68 Feature Engineering for Text

Text is the most abundant and least structured data modality in modern machine learning. A document is a sequence of characters with no fixed length, no inherent numeric meaning, and no obvious alignment between examples. Before any statistical learner can operate on text, that sequence must be transformed into a finite vector of features. This transformation is the subject of feature engineering for text, and despite the dominance of large neural language models, the classical pipeline remains indispensable. It is fast, interpretable, cheap to deploy, and frequently competitive on the tasks that actually fill production backlogs.

This chapter develops the classical text feature pipeline from first principles: tokenization, bag of words, n-grams, term weighting with TF-IDF, and the hashing trick for scale. It then builds a bridge to distributed representations, word and sentence embeddings, and closes with a sober account of when sparse classical features still outperform dense neural ones.

The pipeline is best understood as a sequence of maps from one representation to the next, each adding structure or shedding cost.

flowchart LR
    A["Raw text string"] --> B["Tokens"]
    B --> C["Normalized tokens"]
    C --> D["Document term matrix"]
    D --> E["TF-IDF weighted matrix"]
    C --> F["Hashed feature vector"]
    C --> G["Averaged word embedding"]
    A --> H["Sentence embedding"]
    E --> I["Linear classifier"]
    F --> I
    G --> I
    H --> I

The top path is the sparse symbolic pipeline that dominates Sections 1 through 5. The lower paths are the dense semantic routes of Section 6. Section 7 explains when each is the right choice.

68.1 1. From Characters to Tokens

68.1.1 1.1 The tokenization problem

A learner cannot consume raw bytes meaningfully, so the first step is to segment a string into discrete units called tokens. Tokenization sounds trivial in English, where whitespace approximates word boundaries, but it is full of decisions that materially affect downstream accuracy. Consider the string "Don't email me at jane.doe@acme.co!". Should Don't be one token, two (Do and n't), or three? Is jane.doe@acme.co a single semantic unit or five fragments? Each choice changes the feature space.

A simple regular expression tokenizer captures most word like spans:

import re
def tokenize(text):
    return re.findall(r"[A-Za-z0-9']+", text.lower())

This lowercases, splits on non alphanumeric characters, and keeps apostrophes. It is crude but reproducible, and reproducibility matters more than linguistic elegance because the same function must run at training and inference time. A subtle but consequential rule follows from this: the tokenizer and every normalization step are part of the model, not preprocessing performed once and discarded. Any divergence between the function applied to training data and the function applied at serving time is a silent source of degradation often called training serving skew, and the most reliable defense is to encapsulate the entire transform (in scikit-learn, a Pipeline) so that it is fit once and applied identically thereafter.

To fix vocabulary, define the type token distinction. A token is a single occurrence of a unit in a specific document, while a type is a distinct unit in the abstract vocabulary. The sentence the cat sat on the mat has six tokens but five types, since the occurs twice. The vocabulary is the set of types, and the dimension of the feature space is the number of types retained after pruning.

68.1.2 1.2 Normalization choices

Beyond segmentation, several normalization steps shrink the vocabulary and improve generalization. Lowercasing collapses The and the. Stemming reduces inflected forms to a root, so running, runs, and ran may all map toward run using a rule based algorithm such as the Porter stemmer. Lemmatization does the same but uses a dictionary and part of speech information to produce valid words, mapping better to good. Accent folding and Unicode normalization (NFKC) unify visually identical characters.

Each normalization trades precision for recall in the feature space. Aggressive stemming can merge university and universe, destroying a useful distinction. The right level of normalization is empirical and task dependent, and it should be tuned with cross validation rather than assumed.

68.1.3 1.3 Stop words and pruning

Function words such as the, of, and is occur in nearly every document and carry little discriminative signal for many tasks. Removing a stop word list reduces dimensionality and noise. However, stop word removal is not universally beneficial. For sentiment, the word not is decisive, and for authorship attribution function word frequencies are among the strongest signals. Modern practice often retains stop words and lets the weighting scheme downweight them, which is more robust than a hard list.

A related lever is frequency pruning. Tokens appearing in fewer than min_df documents are likely typos or noise, and tokens in more than max_df fraction of documents behave like stop words. Pruning both tails keeps the vocabulary compact and the model well conditioned. The shape of the count distribution makes this pruning especially effective. Word frequencies in natural language follow Zipf’s law: ranking types by frequency, the frequency of the type at rank $r$ is approximately

\[ f(r) \approx \frac{C}{r^{s}}, \qquad s \approx 1, \]

so a small number of types account for most tokens while a long tail of types each occur once or twice. The long tail inflates dimensionality without adding usable signal, which is exactly why a modest min_df (say 5) can discard the majority of types while removing only a small fraction of total token mass.

68.2 2. The Bag of Words Model

68.2.1 2.1 Vector space representation

The bag of words (BoW) model represents a document as the multiset of its tokens, discarding order entirely. Given a vocabulary $V = \{w_1, \dots, w_{|V|}\}$ extracted from the training corpus, a document $d$ becomes a vector $\mathbf{x}_d \in \mathbb{R}^{|V|}$ whose $j$th entry is the count of term $w_j$ in $d$:

\[ x_{d,j} = \text{tf}(w_j, d) = \sum_{t \in d} \mathbb{1}[t = w_j]. \]

Stacking all document vectors row wise yields the document term matrix $X \in \mathbb{R}^{n \times |V|}$. This matrix is the central object of classical text mining. It is extremely sparse, since any single document uses a tiny fraction of the global vocabulary, so it is stored in compressed sparse row format and the model never materializes the zeros.

68.2.2 2.2 Why discarding order can work

Throwing away word order seems reckless, yet BoW is a strong baseline for topic level tasks such as spam detection, document categorization, and many sentiment problems. The reason is that for these tasks the presence of certain words is nearly sufficient: an email containing viagra, wire transfer, and urgent is probably spam regardless of arrangement. The signal lives in the marginal token statistics, which BoW captures exactly. Tasks that hinge on syntax or compositional meaning, such as the food was good but the service was not, are where BoW begins to fail, and the next section partially repairs this.

The bag of words assumption has a precise probabilistic reading. The multinomial naive Bayes classifier models a document of class $c$ as a sequence of independent draws from a class conditional word distribution, giving

\[ P(d \mid c) = \frac{(\sum_j x_{d,j})!}{\prod_j x_{d,j}!} \prod_{j} P(w_j \mid c)^{x_{d,j}}. \]

The independence of draws is exactly the discarding of order, and the log likelihood is linear in the count vector $\mathbf{x}_d$. This is why a linear classifier over BoW features is not an arbitrary choice but the natural decision rule under the bag assumption: maximizing the posterior reduces to a weighted sum of token counts plus a bias, which is precisely what logistic regression and the linear SVM also compute, learned discriminatively rather than generatively.

68.2.3 2.3 Geometry and similarity

Once documents are vectors, similarity becomes geometry. The dominant measure is cosine similarity, which compares orientation rather than magnitude and so is robust to document length:

\[ \cos(\mathbf{x}, \mathbf{y}) = \frac{\mathbf{x} \cdot \mathbf{y}}{\lVert \mathbf{x} \rVert \, \lVert \mathbf{y} \rVert}. \]

Cosine similarity on BoW or TF-IDF vectors underlies classical search ranking, nearest neighbor classification, and clustering. Its interpretability, every nonzero dimension corresponds to a literal word, is a property that dense embeddings sacrifice.

68.3 3. N-grams and Local Order

68.3.1 3.1 Capturing short phrases

A bigram is an ordered pair of adjacent tokens, a trigram an ordered triple, and in general an n-gram is a contiguous window of $n$ tokens. By adding n-grams to the vocabulary, the BoW model recovers a limited but valuable amount of word order. The phrase not good becomes a single feature distinct from good, allowing a linear model to assign it negative weight. For the sentence New York is not Los Angeles, the bigrams new york and los angeles capture multiword entities that unigrams shatter.

68.3.2 3.2 The dimensionality cost

N-grams are powerful but expensive. If the unigram vocabulary has size $|V|$, the space of possible bigrams is $|V|^2$, and trigrams $|V|^3$. In practice most n-grams never occur, so the realized count grows roughly linearly with corpus size rather than combinatorially, but the feature matrix still explodes from tens of thousands of columns to millions. This drives memory pressure and overfitting risk, and it is the primary motivation for the hashing trick in Section 5.

A common compromise is to include unigrams and bigrams (ngram_range=(1,2)) with aggressive min_df pruning, which captures the highest value local order while keeping the vocabulary manageable.

68.3.3 3.3 Character n-grams

N-grams need not be over words. Character n-grams, contiguous spans of characters, are remarkably robust. The word engineering yields character 4-grams engi, ngin, gine, and so on. Because they share substrings, character n-grams handle misspellings, morphological variation, and out of vocabulary words gracefully, and they require no language specific tokenizer. They are the workhorse of language identification, authorship attribution, and noisy social media text, and they are the conceptual ancestor of subword tokenization in neural models.

68.4 4. Term Weighting with TF-IDF

68.4.1 4.1 The intuition

Raw counts overweight common words. A term appearing in every document discriminates nothing, while a rare term that appears in only a handful of documents is highly informative when it does. TF-IDF (term frequency times inverse document frequency) formalizes this by scaling each count by how surprising the term is across the corpus.

68.4.2 4.2 The formulas

The term frequency component measures local importance. A common variant uses sublinear scaling to dampen the effect of repetition:

\[ \text{tf}(t, d) = 1 + \log\big(\text{count}(t, d)\big) \quad \text{for } \text{count}(t,d) > 0. \]

The inverse document frequency component measures global rarity. Let $n$ be the number of documents and $\text{df}(t)$ the number containing term $t$. A standard smoothed form is

\[ \text{idf}(t) = \log\frac{1 + n}{1 + \text{df}(t)} + 1. \]

The TF-IDF weight is the product, and the resulting document vector is usually L2 normalized so that document length does not dominate:

\[ \text{tfidf}(t, d) = \text{tf}(t, d) \cdot \text{idf}(t), \qquad \hat{\mathbf{x}}_d = \frac{\mathbf{x}_d}{\lVert \mathbf{x}_d \rVert_2}. \]

The smoothing constants prevent division by zero and bound the IDF of terms absent from training. The logarithm in IDF reflects a diminishing returns view of rarity, consistent with an information theoretic reading where $\text{idf}(t)$ approximates the self information of observing term $t$ in a randomly drawn document. If $P(t)$ is the probability that a random document contains $t$, estimated by $\text{df}(t)/n$, then the unsmoothed IDF $\log\frac{n}{\text{df}(t)} = -\log P(t)$ is exactly the Shannon surprisal of the event that $t$ appears. A term that occurs everywhere carries near zero surprisal and is heavily downweighted, while a term confined to a few documents carries high surprisal and is amplified. This is the formal sense in which TF-IDF measures specificity, as Sparck Jones argued in the founding paper on the idea (reference 3).

68.4.3 4.3 A worked example

Take a toy corpus of four documents over the terms cat, dog, and the.

Document	cat	dog	the
$d_1$	2	0	3
$d_2$	0	1	2
$d_3$	1	1	1
$d_4$	0	0	4

Here $n = 4$. The document frequencies are $\text{df}(\text{cat}) = 2$, $\text{df}(\text{dog}) = 2$, and $\text{df}(\text{the}) = 4$. Using the smoothed form $\text{idf}(t) = \log\frac{1+n}{1+\text{df}(t)} + 1$ with natural logarithm,

\[ \text{idf}(\text{cat}) = \log\tfrac{5}{3} + 1 \approx 1.51, \quad \text{idf}(\text{dog}) = \log\tfrac{5}{3} + 1 \approx 1.51, \quad \text{idf}(\text{the}) = \log\tfrac{5}{5} + 1 = 1.00. \]

The term the, present in every document, receives the floor weight of 1, while the rarer cat and dog are boosted. For document $d_3$ with raw counts $(1, 1, 1)$ and sublinear term frequency $1 + \log(1) = 1$ for each, the unnormalized TF-IDF vector is $(1.51,\ 1.51,\ 1.00)$. L2 normalization divides by $\sqrt{1.51^2 + 1.51^2 + 1.00^2} \approx 2.34$, giving the final vector $(0.645,\ 0.645,\ 0.427)$. The common word the survives but is muted, and the two content words dominate the document’s direction in the vector space, which is exactly the behavior cosine similarity then exploits.

68.4.4 4.4 Why it remains a strong baseline

TF-IDF weighted vectors fed to a linear classifier such as logistic regression or a linear support vector machine form one of the most reliable baselines in applied natural language processing. The weighting is parameter free, the model is convex and trains in seconds on millions of documents, and the learned coefficients are directly inspectable. Before reaching for a transformer, a practitioner should establish this baseline, because surprisingly often it is within a point or two of far more expensive models, and it is the number every fancier method must beat.

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
vec = TfidfVectorizer(ngram_range=(1,2), min_df=5, sublinear_tf=True)
X = vec.fit_transform(train_texts)
clf = LogisticRegression(max_iter=1000).fit(X, train_labels)

68.5 5. The Hashing Trick

68.5.1 5.1 Vocabularies do not scale

The vectorizers above must build and store an explicit mapping from each term to a column index. For a streaming corpus or a vocabulary in the tens of millions, this dictionary becomes a memory and engineering bottleneck. It must be learned in a first pass, held in RAM, serialized, and kept consistent between training and serving. The hashing trick eliminates the dictionary entirely.

68.5.2 5.2 Mechanism

Feature hashing maps each token directly to a column index by applying a hash function $h$ and reducing modulo a fixed dimension $m$:

\[ j = h(t) \bmod m. \]

The token’s contribution is added to column $j$. To keep the expected inner product unbiased under collisions, a second hash $\xi(t) \in \{-1, +1\}$ supplies a sign, so the hashed feature is

\[ \phi_j(d) = \sum_{t \in d \,:\, h(t) \bmod m = j} \xi(t)\,\text{count}(t, d). \]

The sign hash is what makes this work. Without it, two colliding terms would always add constructively, systematically inflating the inner product between documents that happen to share a collision. With independent random signs, a collision contributes a zero mean perturbation, so the estimate stays unbiased: for any two documents $\mathbf{x}$ and $\mathbf{y}$, the hashed dot product satisfies $\mathbb{E}[\langle \phi(\mathbf{x}), \phi(\mathbf{y}) \rangle] = \langle \mathbf{x}, \mathbf{y} \rangle$, where the expectation is over the random choice of hash functions (reference 4). The price of an unbiased estimate is variance. The error variance from collisions scales as roughly $\lVert \mathbf{x} \rVert^2 \lVert \mathbf{y} \rVert^2 / m$, so it falls linearly in the table size $m$. Choosing $m$ generously, for example $2^{20}$ or $2^{22}$, drives the collision noise far below the signal for typical sparse text vectors, since each document touches only a tiny fraction of the $m$ buckets.

68.5.3 5.3 Trade-offs

The hashing trick is stateless, single pass, fixed memory, and trivially parallel, which makes it ideal for online learning and very large scale pipelines. The cost is interpretability: two distinct terms can collide into the same column, and there is no inverse map from a column back to a word, so feature inspection and debugging become harder. IDF weighting is also awkward because document frequencies are not tracked, though approximate schemes exist. In practice the hashing trick is chosen when scale and statelessness dominate and interpretability is expendable.

68.6 6. Bridge to Embeddings

68.6.1 6.1 The fundamental limitation of sparse features

Every representation so far treats words as atomic and orthogonal. In the BoW space the vectors for physician and doctor are exactly as dissimilar as physician and banana, because each word is its own dimension with no shared structure. This is the curse of one hot encoding: there is no notion of semantic proximity, and a model that has learned about doctor transfers nothing to physician. Synonymy, polysemy, and graded similarity are all invisible.

68.6.2 6.2 Distributional semantics

The escape route is the distributional hypothesis: words that occur in similar contexts tend to have similar meanings. If we represent a word by the statistics of the contexts it appears in, then physician and doctor, which keep similar company, end up with similar representations. Word embeddings operationalize this by mapping each word to a dense, low dimensional vector $\mathbf{v}_w \in \mathbb{R}^k$ with $k$ typically between 50 and 300, learned so that the geometry of the space reflects co occurrence statistics.

Methods such as word2vec learn these vectors by training a shallow network to predict a word from its context (or vice versa), while GloVe factorizes a global co occurrence matrix. The famous result is that the resulting space supports analogical arithmetic, where

\[ \mathbf{v}_{\text{king}} - \mathbf{v}_{\text{man}} + \mathbf{v}_{\text{woman}} \approx \mathbf{v}_{\text{queen}}, \]

evidence that linear directions in the space encode interpretable semantic relations such as gender or pluralization. Crucially, these dense vectors are low dimensional and continuous, so a model can generalize across related words rather than memorizing each one.

68.6.3 6.3 From words to documents and sentences

Word vectors describe words, but most tasks need a representation of a whole document or sentence. The simplest bridge averages the word vectors in a text, optionally weighting each by its IDF so that informative words dominate:

\[ \mathbf{v}_d = \frac{1}{\sum_{t \in d} \text{idf}(t)} \sum_{t \in d} \text{idf}(t)\,\mathbf{v}_t. \]

This averaged embedding is a strong and cheap document feature, but it again discards order and cannot disambiguate words whose meaning depends on context. Contextual sentence embeddings, produced by transformer encoders such as those in the Sentence-BERT family, address both issues. They feed the entire sentence through a network so that each token’s vector depends on its neighbors, then pool into a single fixed length vector tuned so that semantically similar sentences land near each other under cosine similarity. The output is a dense embedding suitable for semantic search, clustering, retrieval, and as input features to a downstream classifier.

The conceptual arc is therefore continuous. BoW is a sparse, exact, order free count vector. TF-IDF reweights it. Averaged word embeddings replace sparse counts with dense semantics but keep the bag assumption. Sentence embeddings restore context and order through a learned encoder. Each step trades interpretability and cost for semantic generalization.

68.7 7. When Classical Features Still Win

68.7.1 7.1 The honest comparison

It is tempting to assume neural embeddings dominate everywhere, but the evidence is more nuanced, and several regimes favor the classical pipeline.

Linear models over TF-IDF n-gram features are exceptionally strong on topic and keyword driven tasks: spam filtering, language identification, document routing, tagging, and many sentiment and intent classification problems. When the decisive signal is the presence of specific words or phrases, sparse features capture it directly and a linear model exploits it with no risk of underfitting the lexical cues. On standard topic classification benchmarks, a well tuned linear TF-IDF model is frequently within a small margin of a fine tuned transformer at a tiny fraction of the cost.

68.7.2 7.2 Cost, latency, and operations

Classical features win decisively on operational metrics. A TF-IDF plus linear model trains in seconds on a single CPU, serves predictions in microseconds, requires no GPU, has a memory footprint measured in megabytes, and produces coefficients an analyst can read to explain any decision. A transformer embedding pipeline demands accelerators, has higher per request latency, and is far harder to audit. In regulated domains where every decision must be explained, or in high throughput systems where latency and cost are binding constraints, the sparse linear model is often the responsible engineering choice.

68.7.3 7.3 Data, domain, and robustness

Embeddings shine when labeled data is scarce, because pretrained representations transfer knowledge from massive unlabeled corpora. But when labeled data is plentiful and in domain, the advantage narrows, since a linear model has enough signal to learn the task directly. Classical features also resist certain failure modes: they never hallucinate similarity, their behavior on a novel input is mechanically predictable, and they have no opaque dependence on pretraining data whose distribution may not match the deployment domain. For short, specialized, or highly technical vocabularies where general purpose embeddings were undertrained, character n-grams and TF-IDF can outperform off the shelf semantic vectors.

68.7.4 7.4 A practical decision rule

The mature workflow is not to choose ideologically but to layer. Always build the TF-IDF linear baseline first, because it is cheap and sets the bar. Reach for embeddings when the task requires semantic generalization beyond surface tokens, when synonymy and paraphrase matter, when labeled data is limited, or when the baseline plateaus below the required accuracy. Often the best production system concatenates both, sparse lexical features for precise keyword signal and dense embeddings for semantic coverage, giving the downstream model the strengths of each. Feature engineering for text did not end with the rise of embeddings; it absorbed them into a richer toolbox.

68.7.5 7.5 Common pitfalls

A handful of mistakes recur often enough to enumerate explicitly.

Fitting the vectorizer on the full dataset before splitting. The vocabulary and the IDF weights are learned parameters. Computing them over the test set leaks information and inflates reported accuracy. Fit the vectorizer on the training fold only, then transform the validation and test folds with the fitted state.
Treating tokenization as throwaway preprocessing. As noted in Section 1.1, any mismatch between training time and serving time normalization silently degrades quality. Serialize the whole pipeline, not just the model weights.
Over normalizing. Aggressive stemming, stop word removal, and case folding can erase the exact signal a task depends on, such as the negation in sentiment or function word frequencies in authorship attribution. Tune normalization with cross validation rather than applying it reflexively.
Unbounded n-gram ranges. Setting a wide ngram_range without min_df pruning produces a feature matrix with millions of columns, most occurring once, which invites overfitting and exhausts memory. Pair n-grams with frequency pruning.
Forgetting L2 normalization. Without it, long documents have larger vector norms and dominate distance and similarity computations purely because of length. Normalize unless you have a specific reason not to.
Expecting averaged embeddings to capture order. Mean pooling word vectors discards word order just as BoW does, so it cannot distinguish dog bites man from man bites dog. When order or compositional meaning is essential, use a contextual sentence encoder rather than an average.

68.8 References

Manning, C. D., Raghavan, P., and Schütze, H. Introduction to Information Retrieval. Cambridge University Press, 2008. https://nlp.stanford.edu/IR-book/
Jurafsky, D. and Martin, J. H. Speech and Language Processing, 3rd edition draft. 2024. https://web.stanford.edu/~jurafsky/slp3/
Sparck Jones, K. “A Statistical Interpretation of Term Specificity and Its Application in Retrieval.” Journal of Documentation, 1972. https://www.emerald.com/insight/content/doi/10.1108/eb026526/full/html
Weinberger, K., Dasgupta, A., Langford, J., Smola, A., and Attenberg, J. “Feature Hashing for Large Scale Multitask Learning.” ICML, 2009. https://arxiv.org/abs/0902.2206
Mikolov, T., Chen, K., Corrado, G., and Dean, J. “Efficient Estimation of Word Representations in Vector Space.” 2013. https://arxiv.org/abs/1301.3781
Pennington, J., Socher, R., and Manning, C. D. “GloVe: Global Vectors for Word Representation.” EMNLP, 2014. https://nlp.stanford.edu/projects/glove/
Reimers, N. and Gurevych, I. “Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks.” EMNLP, 2019. https://arxiv.org/abs/1908.10084
Wang, S. and Manning, C. D. “Baselines and Bigrams: Simple, Good Sentiment and Topic Classification.” ACL, 2012. https://aclanthology.org/P12-2018/
Pedregosa, F. et al. “Scikit-learn: Machine Learning in Python.” JMLR, 2011. https://jmlr.org/papers/v12/pedregosa11a.html
Joulin, A., Grave, E., Bojanowski, P., and Mikolov, T. “Bag of Tricks for Efficient Text Classification.” EACL, 2017. https://arxiv.org/abs/1607.01759

# Feature Engineering for Text Text is the most abundant and least structured data modality in modern machine learning. A document is a sequence of characters with no fixed length, no inherent numeric meaning, and no obvious alignment between examples. Before any statistical learner can operate on text, that sequence must be transformed into a finite vector of features. This transformation is the subject of feature engineering for text, and despite the dominance of large neural language models, the classical pipeline remains indispensable. It is fast, interpretable, cheap to deploy, and frequently competitive on the tasks that actually fill production backlogs. This chapter develops the classical text feature pipeline from first principles: tokenization, bag of words, n-grams, term weighting with TF-IDF, and the hashing trick for scale. It then builds a bridge to distributed representations, word and sentence embeddings, and closes with a sober account of when sparse classical features still outperform dense neural ones. The pipeline is best understood as a sequence of maps from one representation to the next, each adding structure or shedding cost. ```{mermaid} flowchart LR A["Raw text string"] --> B["Tokens"] B --> C["Normalized tokens"] C --> D["Document term matrix"] D --> E["TF-IDF weighted matrix"] C --> F["Hashed feature vector"] C --> G["Averaged word embedding"] A --> H["Sentence embedding"] E --> I["Linear classifier"] F --> I G --> I H --> I ``` The top path is the sparse symbolic pipeline that dominates Sections 1 through 5. The lower paths are the dense semantic routes of Section 6. Section 7 explains when each is the right choice. ## 1. From Characters to Tokens ### 1.1 The tokenization problem A learner cannot consume raw bytes meaningfully, so the first step is to segment a string into discrete units called tokens. Tokenization sounds trivial in English, where whitespace approximates word boundaries, but it is full of decisions that materially affect downstream accuracy. Consider the string `"Don't email me at jane.doe@acme.co!"`. Should `Don't` be one token, two (`Do` and `n't`), or three? Is `jane.doe@acme.co` a single semantic unit or five fragments? Each choice changes the feature space. A simple regular expression tokenizer captures most word like spans: ```python import re def tokenize(text): return re.findall(r"[A-Za-z0-9']+", text.lower()) ``` This lowercases, splits on non alphanumeric characters, and keeps apostrophes. It is crude but reproducible, and reproducibility matters more than linguistic elegance because the same function must run at training and inference time. A subtle but consequential rule follows from this: the tokenizer and every normalization step are part of the model, not preprocessing performed once and discarded. Any divergence between the function applied to training data and the function applied at serving time is a silent source of degradation often called training serving skew, and the most reliable defense is to encapsulate the entire transform (in scikit-learn, a `Pipeline`) so that it is fit once and applied identically thereafter. To fix vocabulary, define the **type token distinction**. A token is a single occurrence of a unit in a specific document, while a type is a distinct unit in the abstract vocabulary. The sentence `the cat sat on the mat` has six tokens but five types, since `the` occurs twice. The vocabulary is the set of types, and the dimension of the feature space is the number of types retained after pruning. ### 1.2 Normalization choices Beyond segmentation, several normalization steps shrink the vocabulary and improve generalization. Lowercasing collapses `The` and `the`. Stemming reduces inflected forms to a root, so `running`, `runs`, and `ran` may all map toward `run` using a rule based algorithm such as the Porter stemmer. Lemmatization does the same but uses a dictionary and part of speech information to produce valid words, mapping `better` to `good`. Accent folding and Unicode normalization (NFKC) unify visually identical characters. Each normalization trades precision for recall in the feature space. Aggressive stemming can merge `university` and `universe`, destroying a useful distinction. The right level of normalization is empirical and task dependent, and it should be tuned with cross validation rather than assumed. ### 1.3 Stop words and pruning Function words such as `the`, `of`, and `is` occur in nearly every document and carry little discriminative signal for many tasks. Removing a stop word list reduces dimensionality and noise. However, stop word removal is not universally beneficial. For sentiment, the word `not` is decisive, and for authorship attribution function word frequencies are among the strongest signals. Modern practice often retains stop words and lets the weighting scheme downweight them, which is more robust than a hard list. A related lever is frequency pruning. Tokens appearing in fewer than `min_df` documents are likely typos or noise, and tokens in more than `max_df` fraction of documents behave like stop words. Pruning both tails keeps the vocabulary compact and the model well conditioned. The shape of the count distribution makes this pruning especially effective. Word frequencies in natural language follow Zipf's law: ranking types by frequency, the frequency of the type at rank $r$ is approximately $$ f(r) \approx \frac{C}{r^{s}}, \qquad s \approx 1, $$ so a small number of types account for most tokens while a long tail of types each occur once or twice. The long tail inflates dimensionality without adding usable signal, which is exactly why a modest `min_df` (say 5) can discard the majority of types while removing only a small fraction of total token mass. ## 2. The Bag of Words Model ### 2.1 Vector space representation The bag of words (BoW) model represents a document as the multiset of its tokens, discarding order entirely. Given a vocabulary $V = \{w_1, \dots, w_{|V|}\}$ extracted from the training corpus, a document $d$ becomes a vector $\mathbf{x}_d \in \mathbb{R}^{|V|}$ whose $j$th entry is the count of term $w_j$ in $d$: $$ x_{d,j} = \text{tf}(w_j, d) = \sum_{t \in d} \mathbb{1}[t = w_j]. $$ Stacking all document vectors row wise yields the document term matrix $X \in \mathbb{R}^{n \times |V|}$. This matrix is the central object of classical text mining. It is extremely sparse, since any single document uses a tiny fraction of the global vocabulary, so it is stored in compressed sparse row format and the model never materializes the zeros. ### 2.2 Why discarding order can work Throwing away word order seems reckless, yet BoW is a strong baseline for topic level tasks such as spam detection, document categorization, and many sentiment problems. The reason is that for these tasks the presence of certain words is nearly sufficient: an email containing `viagra`, `wire transfer`, and `urgent` is probably spam regardless of arrangement. The signal lives in the marginal token statistics, which BoW captures exactly. Tasks that hinge on syntax or compositional meaning, such as `the food was good but the service was not`, are where BoW begins to fail, and the next section partially repairs this. The bag of words assumption has a precise probabilistic reading. The multinomial naive Bayes classifier models a document of class $c$ as a sequence of independent draws from a class conditional word distribution, giving $$ P(d \mid c) = \frac{(\sum_j x_{d,j})!}{\prod_j x_{d,j}!} \prod_{j} P(w_j \mid c)^{x_{d,j}}. $$ The independence of draws is exactly the discarding of order, and the log likelihood is linear in the count vector $\mathbf{x}_d$. This is why a linear classifier over BoW features is not an arbitrary choice but the natural decision rule under the bag assumption: maximizing the posterior reduces to a weighted sum of token counts plus a bias, which is precisely what logistic regression and the linear SVM also compute, learned discriminatively rather than generatively. ### 2.3 Geometry and similarity Once documents are vectors, similarity becomes geometry. The dominant measure is cosine similarity, which compares orientation rather than magnitude and so is robust to document length: $$ \cos(\mathbf{x}, \mathbf{y}) = \frac{\mathbf{x} \cdot \mathbf{y}}{\lVert \mathbf{x} \rVert \, \lVert \mathbf{y} \rVert}. $$ Cosine similarity on BoW or TF-IDF vectors underlies classical search ranking, nearest neighbor classification, and clustering. Its interpretability, every nonzero dimension corresponds to a literal word, is a property that dense embeddings sacrifice. ## 3. N-grams and Local Order ### 3.1 Capturing short phrases A bigram is an ordered pair of adjacent tokens, a trigram an ordered triple, and in general an n-gram is a contiguous window of $n$ tokens. By adding n-grams to the vocabulary, the BoW model recovers a limited but valuable amount of word order. The phrase `not good` becomes a single feature distinct from `good`, allowing a linear model to assign it negative weight. For the sentence `New York is not Los Angeles`, the bigrams `new york` and `los angeles` capture multiword entities that unigrams shatter. ### 3.2 The dimensionality cost N-grams are powerful but expensive. If the unigram vocabulary has size $|V|$, the space of possible bigrams is $|V|^2$, and trigrams $|V|^3$. In practice most n-grams never occur, so the realized count grows roughly linearly with corpus size rather than combinatorially, but the feature matrix still explodes from tens of thousands of columns to millions. This drives memory pressure and overfitting risk, and it is the primary motivation for the hashing trick in Section 5. A common compromise is to include unigrams and bigrams (`ngram_range=(1,2)`) with aggressive `min_df` pruning, which captures the highest value local order while keeping the vocabulary manageable. ### 3.3 Character n-grams N-grams need not be over words. Character n-grams, contiguous spans of characters, are remarkably robust. The word `engineering` yields character 4-grams `engi`, `ngin`, `gine`, and so on. Because they share substrings, character n-grams handle misspellings, morphological variation, and out of vocabulary words gracefully, and they require no language specific tokenizer. They are the workhorse of language identification, authorship attribution, and noisy social media text, and they are the conceptual ancestor of subword tokenization in neural models. ## 4. Term Weighting with TF-IDF ### 4.1 The intuition Raw counts overweight common words. A term appearing in every document discriminates nothing, while a rare term that appears in only a handful of documents is highly informative when it does. TF-IDF (term frequency times inverse document frequency) formalizes this by scaling each count by how surprising the term is across the corpus. ### 4.2 The formulas The term frequency component measures local importance. A common variant uses sublinear scaling to dampen the effect of repetition: $$ \text{tf}(t, d) = 1 + \log\big(\text{count}(t, d)\big) \quad \text{for } \text{count}(t,d) > 0. $$ The inverse document frequency component measures global rarity. Let $n$ be the number of documents and $\text{df}(t)$ the number containing term $t$. A standard smoothed form is $$ \text{idf}(t) = \log\frac{1 + n}{1 + \text{df}(t)} + 1. $$ The TF-IDF weight is the product, and the resulting document vector is usually L2 normalized so that document length does not dominate: $$ \text{tfidf}(t, d) = \text{tf}(t, d) \cdot \text{idf}(t), \qquad \hat{\mathbf{x}}_d = \frac{\mathbf{x}_d}{\lVert \mathbf{x}_d \rVert_2}. $$ The smoothing constants prevent division by zero and bound the IDF of terms absent from training. The logarithm in IDF reflects a diminishing returns view of rarity, consistent with an information theoretic reading where $\text{idf}(t)$ approximates the self information of observing term $t$ in a randomly drawn document. If $P(t)$ is the probability that a random document contains $t$, estimated by $\text{df}(t)/n$, then the unsmoothed IDF $\log\frac{n}{\text{df}(t)} = -\log P(t)$ is exactly the Shannon surprisal of the event that $t$ appears. A term that occurs everywhere carries near zero surprisal and is heavily downweighted, while a term confined to a few documents carries high surprisal and is amplified. This is the formal sense in which TF-IDF measures specificity, as Sparck Jones argued in the founding paper on the idea (reference 3). ### 4.3 A worked example Take a toy corpus of four documents over the terms `cat`, `dog`, and `the`. | Document | cat | dog | the | |----------|-----|-----|-----| | $d_1$ | 2 | 0 | 3 | | $d_2$ | 0 | 1 | 2 | | $d_3$ | 1 | 1 | 1 | | $d_4$ | 0 | 0 | 4 | Here $n = 4$. The document frequencies are $\text{df}(\text{cat}) = 2$, $\text{df}(\text{dog}) = 2$, and $\text{df}(\text{the}) = 4$. Using the smoothed form $\text{idf}(t) = \log\frac{1+n}{1+\text{df}(t)} + 1$ with natural logarithm, $$ \text{idf}(\text{cat}) = \log\tfrac{5}{3} + 1 \approx 1.51, \quad \text{idf}(\text{dog}) = \log\tfrac{5}{3} + 1 \approx 1.51, \quad \text{idf}(\text{the}) = \log\tfrac{5}{5} + 1 = 1.00. $$ The term `the`, present in every document, receives the floor weight of 1, while the rarer `cat` and `dog` are boosted. For document $d_3$ with raw counts $(1, 1, 1)$ and sublinear term frequency $1 + \log(1) = 1$ for each, the unnormalized TF-IDF vector is $(1.51,\ 1.51,\ 1.00)$. L2 normalization divides by $\sqrt{1.51^2 + 1.51^2 + 1.00^2} \approx 2.34$, giving the final vector $(0.645,\ 0.645,\ 0.427)$. The common word `the` survives but is muted, and the two content words dominate the document's direction in the vector space, which is exactly the behavior cosine similarity then exploits. ### 4.4 Why it remains a strong baseline TF-IDF weighted vectors fed to a linear classifier such as logistic regression or a linear support vector machine form one of the most reliable baselines in applied natural language processing. The weighting is parameter free, the model is convex and trains in seconds on millions of documents, and the learned coefficients are directly inspectable. Before reaching for a transformer, a practitioner should establish this baseline, because surprisingly often it is within a point or two of far more expensive models, and it is the number every fancier method must beat. ```python from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.linear_model import LogisticRegression vec = TfidfVectorizer(ngram_range=(1,2), min_df=5, sublinear_tf=True) X = vec.fit_transform(train_texts) clf = LogisticRegression(max_iter=1000).fit(X, train_labels) ``` ## 5. The Hashing Trick ### 5.1 Vocabularies do not scale The vectorizers above must build and store an explicit mapping from each term to a column index. For a streaming corpus or a vocabulary in the tens of millions, this dictionary becomes a memory and engineering bottleneck. It must be learned in a first pass, held in RAM, serialized, and kept consistent between training and serving. The hashing trick eliminates the dictionary entirely. ### 5.2 Mechanism Feature hashing maps each token directly to a column index by applying a hash function $h$ and reducing modulo a fixed dimension $m$: $$ j = h(t) \bmod m. $$ The token's contribution is added to column $j$. To keep the expected inner product unbiased under collisions, a second hash $\xi(t) \in \{-1, +1\}$ supplies a sign, so the hashed feature is $$ \phi_j(d) = \sum_{t \in d \,:\, h(t) \bmod m = j} \xi(t)\,\text{count}(t, d). $$ The sign hash is what makes this work. Without it, two colliding terms would always add constructively, systematically inflating the inner product between documents that happen to share a collision. With independent random signs, a collision contributes a zero mean perturbation, so the estimate stays unbiased: for any two documents $\mathbf{x}$ and $\mathbf{y}$, the hashed dot product satisfies $\mathbb{E}[\langle \phi(\mathbf{x}), \phi(\mathbf{y}) \rangle] = \langle \mathbf{x}, \mathbf{y} \rangle$, where the expectation is over the random choice of hash functions (reference 4). The price of an unbiased estimate is variance. The error variance from collisions scales as roughly $\lVert \mathbf{x} \rVert^2 \lVert \mathbf{y} \rVert^2 / m$, so it falls linearly in the table size $m$. Choosing $m$ generously, for example $2^{20}$ or $2^{22}$, drives the collision noise far below the signal for typical sparse text vectors, since each document touches only a tiny fraction of the $m$ buckets. ### 5.3 Trade-offs The hashing trick is stateless, single pass, fixed memory, and trivially parallel, which makes it ideal for online learning and very large scale pipelines. The cost is interpretability: two distinct terms can collide into the same column, and there is no inverse map from a column back to a word, so feature inspection and debugging become harder. IDF weighting is also awkward because document frequencies are not tracked, though approximate schemes exist. In practice the hashing trick is chosen when scale and statelessness dominate and interpretability is expendable. ## 6. Bridge to Embeddings ### 6.1 The fundamental limitation of sparse features Every representation so far treats words as atomic and orthogonal. In the BoW space the vectors for `physician` and `doctor` are exactly as dissimilar as `physician` and `banana`, because each word is its own dimension with no shared structure. This is the curse of one hot encoding: there is no notion of semantic proximity, and a model that has learned about `doctor` transfers nothing to `physician`. Synonymy, polysemy, and graded similarity are all invisible. ### 6.2 Distributional semantics The escape route is the distributional hypothesis: words that occur in similar contexts tend to have similar meanings. If we represent a word by the statistics of the contexts it appears in, then `physician` and `doctor`, which keep similar company, end up with similar representations. Word embeddings operationalize this by mapping each word to a dense, low dimensional vector $\mathbf{v}_w \in \mathbb{R}^k$ with $k$ typically between 50 and 300, learned so that the geometry of the space reflects co occurrence statistics. Methods such as word2vec learn these vectors by training a shallow network to predict a word from its context (or vice versa), while GloVe factorizes a global co occurrence matrix. The famous result is that the resulting space supports analogical arithmetic, where $$ \mathbf{v}_{\text{king}} - \mathbf{v}_{\text{man}} + \mathbf{v}_{\text{woman}} \approx \mathbf{v}_{\text{queen}}, $$ evidence that linear directions in the space encode interpretable semantic relations such as gender or pluralization. Crucially, these dense vectors are low dimensional and continuous, so a model can generalize across related words rather than memorizing each one. ### 6.3 From words to documents and sentences Word vectors describe words, but most tasks need a representation of a whole document or sentence. The simplest bridge averages the word vectors in a text, optionally weighting each by its IDF so that informative words dominate: $$ \mathbf{v}_d = \frac{1}{\sum_{t \in d} \text{idf}(t)} \sum_{t \in d} \text{idf}(t)\,\mathbf{v}_t. $$ This averaged embedding is a strong and cheap document feature, but it again discards order and cannot disambiguate words whose meaning depends on context. Contextual sentence embeddings, produced by transformer encoders such as those in the Sentence-BERT family, address both issues. They feed the entire sentence through a network so that each token's vector depends on its neighbors, then pool into a single fixed length vector tuned so that semantically similar sentences land near each other under cosine similarity. The output is a dense embedding suitable for semantic search, clustering, retrieval, and as input features to a downstream classifier. The conceptual arc is therefore continuous. BoW is a sparse, exact, order free count vector. TF-IDF reweights it. Averaged word embeddings replace sparse counts with dense semantics but keep the bag assumption. Sentence embeddings restore context and order through a learned encoder. Each step trades interpretability and cost for semantic generalization. ## 7. When Classical Features Still Win ### 7.1 The honest comparison It is tempting to assume neural embeddings dominate everywhere, but the evidence is more nuanced, and several regimes favor the classical pipeline. Linear models over TF-IDF n-gram features are exceptionally strong on topic and keyword driven tasks: spam filtering, language identification, document routing, tagging, and many sentiment and intent classification problems. When the decisive signal is the presence of specific words or phrases, sparse features capture it directly and a linear model exploits it with no risk of underfitting the lexical cues. On standard topic classification benchmarks, a well tuned linear TF-IDF model is frequently within a small margin of a fine tuned transformer at a tiny fraction of the cost. ### 7.2 Cost, latency, and operations Classical features win decisively on operational metrics. A TF-IDF plus linear model trains in seconds on a single CPU, serves predictions in microseconds, requires no GPU, has a memory footprint measured in megabytes, and produces coefficients an analyst can read to explain any decision. A transformer embedding pipeline demands accelerators, has higher per request latency, and is far harder to audit. In regulated domains where every decision must be explained, or in high throughput systems where latency and cost are binding constraints, the sparse linear model is often the responsible engineering choice. ### 7.3 Data, domain, and robustness Embeddings shine when labeled data is scarce, because pretrained representations transfer knowledge from massive unlabeled corpora. But when labeled data is plentiful and in domain, the advantage narrows, since a linear model has enough signal to learn the task directly. Classical features also resist certain failure modes: they never hallucinate similarity, their behavior on a novel input is mechanically predictable, and they have no opaque dependence on pretraining data whose distribution may not match the deployment domain. For short, specialized, or highly technical vocabularies where general purpose embeddings were undertrained, character n-grams and TF-IDF can outperform off the shelf semantic vectors. ### 7.4 A practical decision rule The mature workflow is not to choose ideologically but to layer. Always build the TF-IDF linear baseline first, because it is cheap and sets the bar. Reach for embeddings when the task requires semantic generalization beyond surface tokens, when synonymy and paraphrase matter, when labeled data is limited, or when the baseline plateaus below the required accuracy. Often the best production system concatenates both, sparse lexical features for precise keyword signal and dense embeddings for semantic coverage, giving the downstream model the strengths of each. Feature engineering for text did not end with the rise of embeddings; it absorbed them into a richer toolbox. ### 7.5 Common pitfalls A handful of mistakes recur often enough to enumerate explicitly. - **Fitting the vectorizer on the full dataset before splitting.** The vocabulary and the IDF weights are learned parameters. Computing them over the test set leaks information and inflates reported accuracy. Fit the vectorizer on the training fold only, then transform the validation and test folds with the fitted state. - **Treating tokenization as throwaway preprocessing.** As noted in Section 1.1, any mismatch between training time and serving time normalization silently degrades quality. Serialize the whole pipeline, not just the model weights. - **Over normalizing.** Aggressive stemming, stop word removal, and case folding can erase the exact signal a task depends on, such as the negation in sentiment or function word frequencies in authorship attribution. Tune normalization with cross validation rather than applying it reflexively. - **Unbounded n-gram ranges.** Setting a wide `ngram_range` without `min_df` pruning produces a feature matrix with millions of columns, most occurring once, which invites overfitting and exhausts memory. Pair n-grams with frequency pruning. - **Forgetting L2 normalization.** Without it, long documents have larger vector norms and dominate distance and similarity computations purely because of length. Normalize unless you have a specific reason not to. - **Expecting averaged embeddings to capture order.** Mean pooling word vectors discards word order just as BoW does, so it cannot distinguish `dog bites man` from `man bites dog`. When order or compositional meaning is essential, use a contextual sentence encoder rather than an average. ## References 1. Manning, C. D., Raghavan, P., and Schütze, H. *Introduction to Information Retrieval*. Cambridge University Press, 2008. https://nlp.stanford.edu/IR-book/ 2. Jurafsky, D. and Martin, J. H. *Speech and Language Processing*, 3rd edition draft. 2024. https://web.stanford.edu/~jurafsky/slp3/ 3. Sparck Jones, K. "A Statistical Interpretation of Term Specificity and Its Application in Retrieval." *Journal of Documentation*, 1972. https://www.emerald.com/insight/content/doi/10.1108/eb026526/full/html 4. Weinberger, K., Dasgupta, A., Langford, J., Smola, A., and Attenberg, J. "Feature Hashing for Large Scale Multitask Learning." ICML, 2009. https://arxiv.org/abs/0902.2206 5. Mikolov, T., Chen, K., Corrado, G., and Dean, J. "Efficient Estimation of Word Representations in Vector Space." 2013. https://arxiv.org/abs/1301.3781 6. Pennington, J., Socher, R., and Manning, C. D. "GloVe: Global Vectors for Word Representation." EMNLP, 2014. https://nlp.stanford.edu/projects/glove/ 7. Reimers, N. and Gurevych, I. "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks." EMNLP, 2019. https://arxiv.org/abs/1908.10084 8. Wang, S. and Manning, C. D. "Baselines and Bigrams: Simple, Good Sentiment and Topic Classification." ACL, 2012. https://aclanthology.org/P12-2018/ 9. Pedregosa, F. et al. "Scikit-learn: Machine Learning in Python." *JMLR*, 2011. https://jmlr.org/papers/v12/pedregosa11a.html 10. Joulin, A., Grave, E., Bojanowski, P., and Mikolov, T. "Bag of Tricks for Efficient Text Classification." EACL, 2017. https://arxiv.org/abs/1607.01759

Document	cat	dog	the
\(d_1\)	2	0	3
\(d_2\)	0	1	2
\(d_3\)	1	1	1
\(d_4\)	0	0	4