116 Linear Support Vector Machines

Support vector machines (SVMs) occupy a special place in the history of machine learning. They emerged from statistical learning theory in the 1990s and dominated practical classification for more than a decade before deep networks reclaimed the spotlight. Even today they remain a workhorse for tabular data, text classification, and any setting where the number of features is large relative to the number of samples. The linear SVM is the natural starting point: it captures the central idea of margin maximization in a form that is geometrically transparent, computationally tractable, and rich enough to motivate the kernel extensions that follow.

This chapter develops the linear SVM from first principles. We begin with the geometry of the maximum-margin hyperplane, formalize the hard-margin and soft-margin problems, derive the dual via Lagrangian duality, interpret the support vectors, and finally recast the entire method as empirical risk minimization with the hinge loss. By the end you should be able to read an SVM solver’s output, reason about its hyperparameters, and explain why the method generalizes well in high dimensions.

116.1 1. The Geometry of Separating Hyperplanes

116.1.1 1.1 Linear classifiers and decision boundaries

Consider a binary classification problem with training data $\{(\mathbf{x}_i, y_i)\}_{i=1}^n$, where $\mathbf{x}_i \in \mathbb{R}^d$ and labels $y_i \in \{-1, +1\}$. A linear classifier is defined by a weight vector $\mathbf{w} \in \mathbb{R}^d$ and a bias $b \in \mathbb{R}$, producing the decision function

\[ f(\mathbf{x}) = \operatorname{sign}(\mathbf{w}^\top \mathbf{x} + b). \]

The set $\{\mathbf{x} : \mathbf{w}^\top \mathbf{x} + b = 0\}$ is a hyperplane that splits the input space into two half-spaces. The vector $\mathbf{w}$ is orthogonal to this hyperplane, and $b$ shifts it away from the origin. Any point classified positive lies on the side that $\mathbf{w}$ points toward.

The data are linearly separable if some $(\mathbf{w}, b)$ classifies every training point correctly, that is, $y_i(\mathbf{w}^\top \mathbf{x}_i + b) > 0$ for all $i$. When a separating hyperplane exists, infinitely many of them exist: you can tilt or translate the boundary slightly and still separate the classes. The perceptron algorithm finds one of these arbitrarily, with no preference among them. The SVM asks a sharper question. Of all separating hyperplanes, which one is best?

116.1.2 1.2 The margin

The SVM answer is the hyperplane that sits as far as possible from the nearest data points of either class. Intuitively, a boundary that clears the data by a wide buffer is more robust to noise and more likely to generalize, because small perturbations of a test point will not push it across the boundary.

The signed distance from a point $\mathbf{x}_i$ to the hyperplane is

\[ \frac{\mathbf{w}^\top \mathbf{x}_i + b}{\lVert \mathbf{w} \rVert}. \]

Multiplying by the label $y_i$ gives the functional margin scaled to a geometric quantity. The geometric margin of the point is $y_i(\mathbf{w}^\top \mathbf{x}_i + b) / \lVert \mathbf{w} \rVert$, which is positive exactly when the point is correctly classified. The margin of the classifier is the smallest such distance over the whole training set:

\[ \gamma = \min_{i} \frac{y_i(\mathbf{w}^\top \mathbf{x}_i + b)}{\lVert \mathbf{w} \rVert}. \]

The maximum-margin principle says: choose $(\mathbf{w}, b)$ to maximize $\gamma$. This single criterion turns an underdetermined problem with infinitely many solutions into a well-posed optimization with a unique answer (when the data are separable).

116.2 2. The Hard-Margin SVM

116.2.1 2.1 From margin to a clean optimization

Maximizing $\gamma$ directly is awkward because $(\mathbf{w}, b)$ and the margin can be rescaled together without changing the hyperplane. If we multiply both $\mathbf{w}$ and $b$ by any positive constant, the decision boundary is identical but the functional margin scales. We remove this redundancy by fixing the scale: require the closest points to satisfy $y_i(\mathbf{w}^\top \mathbf{x}_i + b) = 1$. This is called the canonical form. Under this normalization the geometric margin becomes $1 / \lVert \mathbf{w} \rVert$, and the total width of the margin band, the distance between the two parallel supporting hyperplanes $\mathbf{w}^\top\mathbf{x}+b = \pm 1$, is $2 / \lVert \mathbf{w} \rVert$.

Maximizing $1/\lVert \mathbf{w} \rVert$ is the same as minimizing $\lVert \mathbf{w} \rVert$, and for analytic convenience we minimize $\tfrac{1}{2}\lVert \mathbf{w} \rVert^2$. The hard-margin SVM is therefore the constrained problem

\[ \min_{\mathbf{w}, b} \ \frac{1}{2}\lVert \mathbf{w} \rVert^2 \quad \text{subject to} \quad y_i(\mathbf{w}^\top \mathbf{x}_i + b) \ge 1, \quad i = 1, \dots, n. \]

This is a convex quadratic program: a quadratic objective with linear inequality constraints. Convexity guarantees that any local optimum is global, so there is no risk of getting stuck in a bad solution. The constraints encode that every point must lie on the correct side of its supporting hyperplane, no closer to the boundary than the margin allows.

116.2.2 2.2 Limitations of the hard margin

The hard-margin formulation has two serious weaknesses. First, it is feasible only when the data are perfectly linearly separable. A single mislabeled point or a touch of class overlap makes the constraint set empty and the problem has no solution. Second, even when separability holds, the solution can be dictated by outliers. One anomalous point near the boundary can dramatically narrow the margin and rotate the hyperplane, hurting generalization. Real data are rarely cleanly separable, so we need a formulation that tolerates violations. That is the soft margin.

116.3 3. The Soft-Margin SVM

116.3.1 3.1 Slack variables

To allow some points to violate the margin, we introduce a slack variable $\xi_i \ge 0$ for each training example and relax the constraint to

\[ y_i(\mathbf{w}^\top \mathbf{x}_i + b) \ge 1 - \xi_i. \]

The slack $\xi_i$ measures how far point $i$ intrudes into or past the margin. If $\xi_i = 0$, the point satisfies the original hard constraint and lies outside the margin. If $0 < \xi_i \le 1$, the point is inside the margin band but still on the correct side of the decision boundary. If $\xi_i > 1$, the point is misclassified. We do not want slack to be free, so we penalize the total slack in the objective:

\[ \min_{\mathbf{w}, b, \boldsymbol{\xi}} \ \frac{1}{2}\lVert \mathbf{w} \rVert^2 + C \sum_{i=1}^n \xi_i \quad \text{subject to} \quad y_i(\mathbf{w}^\top \mathbf{x}_i + b) \ge 1 - \xi_i, \ \ \xi_i \ge 0. \]

This remains a convex quadratic program. It always has a feasible solution because we can satisfy any constraint by making $\xi_i$ large enough, so non-separable data pose no problem.

116.3.2 3.2 The role of the regularization parameter C

The constant $C > 0$ controls the trade-off between two competing goals: a wide margin (small $\lVert \mathbf{w} \rVert$) and few margin violations (small $\sum \xi_i$). It is the single most important hyperparameter of the linear SVM.

A large $C$ heavily penalizes slack, pushing the optimizer to classify training points correctly even at the cost of a narrow margin. In the limit $C \to \infty$ the soft margin recovers the hard margin. This regime risks overfitting, because the boundary contorts to accommodate noisy points. A small $C$ tolerates more violations in exchange for a wider, smoother margin. This regularizes the model and often generalizes better, but if $C$ is too small the classifier underfits and may ignore real structure. In practice $C$ is chosen by cross validation, typically searching over a logarithmic grid such as $\{10^{-3}, 10^{-2}, \dots, 10^{3}\}$.

It helps to see $C$ as the inverse of a regularization strength. Rewriting the objective as $\sum_i \xi_i + \tfrac{1}{2C}\lVert \mathbf{w} \rVert^2$ shows that $1/C$ multiplies the regularizer. Large $C$ means weak regularization, small $C$ means strong regularization.

116.4 4. The Dual Formulation

116.4.1 4.1 The Lagrangian

The constrained problem above is the primal. Its dual, obtained through Lagrangian duality, is central to both the theory and the practical solution of SVMs. We attach a multiplier $\alpha_i \ge 0$ to each margin constraint and a multiplier $\mu_i \ge 0$ to each non-negativity constraint $\xi_i \ge 0$. The Lagrangian for the soft-margin problem is

\[ \mathcal{L}(\mathbf{w}, b, \boldsymbol{\xi}, \boldsymbol{\alpha}, \boldsymbol{\mu}) = \frac{1}{2}\lVert \mathbf{w} \rVert^2 + C\sum_i \xi_i - \sum_i \alpha_i \big[ y_i(\mathbf{w}^\top \mathbf{x}_i + b) - 1 + \xi_i \big] - \sum_i \mu_i \xi_i. \]

Setting the partial derivatives of $\mathcal{L}$ with respect to the primal variables to zero gives the stationarity conditions:

\[ \frac{\partial \mathcal{L}}{\partial \mathbf{w}} = 0 \ \Rightarrow \ \mathbf{w} = \sum_i \alpha_i y_i \mathbf{x}_i, \] \[ \frac{\partial \mathcal{L}}{\partial b} = 0 \ \Rightarrow \ \sum_i \alpha_i y_i = 0, \] \[ \frac{\partial \mathcal{L}}{\partial \xi_i} = 0 \ \Rightarrow \ C - \alpha_i - \mu_i = 0. \]

The first condition is remarkable. The optimal weight vector is a linear combination of the training inputs, weighted by $\alpha_i y_i$. The classifier lives entirely in the span of the data.

116.4.2 4.2 The dual problem

Substituting these conditions back into $\mathcal{L}$ eliminates $\mathbf{w}$, $b$, and $\boldsymbol{\xi}$. The condition $C - \alpha_i - \mu_i = 0$ together with $\mu_i \ge 0$ implies $0 \le \alpha_i \le C$, the so-called box constraint. The resulting dual problem is

\[ \max_{\boldsymbol{\alpha}} \ \sum_{i=1}^n \alpha_i - \frac{1}{2}\sum_{i=1}^n \sum_{j=1}^n \alpha_i \alpha_j y_i y_j \, \mathbf{x}_i^\top \mathbf{x}_j \] \[ \text{subject to} \quad \sum_{i=1}^n \alpha_i y_i = 0, \quad 0 \le \alpha_i \le C. \]

This is again a convex quadratic program, now in the $n$ dual variables $\alpha_i$ rather than the $d$ primal variables of $\mathbf{w}$. Two features deserve emphasis. First, the data appear only through inner products $\mathbf{x}_i^\top \mathbf{x}_j$. This is the doorway to the kernel trick: replace the inner product with a kernel function $k(\mathbf{x}_i, \mathbf{x}_j)$ and the same machinery learns a nonlinear boundary without ever forming the high-dimensional feature map explicitly. Second, the dual has $n$ variables, so it is attractive when $d \gg n$, as in text or genomics, where the feature dimension dwarfs the sample size.

116.4.3 4.3 Strong duality and the KKT conditions

Because the primal is convex and the constraints are affine, Slater’s condition holds and strong duality applies: the optimal primal and dual objective values coincide, and solving the dual recovers the primal solution. At the optimum the Karush-Kuhn-Tucker (KKT) conditions hold. The complementary slackness conditions are the most informative:

\[ \alpha_i \big[ y_i(\mathbf{w}^\top \mathbf{x}_i + b) - 1 + \xi_i \big] = 0, \qquad \mu_i \xi_i = 0. \]

These equalities pin down the structure of the solution and lead directly to the notion of support vectors, which we examine next.

116.5 5. Support Vectors

116.5.1 5.1 Three categories of points

The complementary slackness conditions partition the training data into three groups according to the value of $\alpha_i$.

When $\alpha_i = 0$, the point sits strictly outside the margin and is correctly classified with room to spare. It contributes nothing to $\mathbf{w} = \sum_i \alpha_i y_i \mathbf{x}_i$. You could delete it from the training set and the solution would not change.

When $0 < \alpha_i < C$, then $\mu_i = C - \alpha_i > 0$, which forces $\xi_i = 0$. The point lies exactly on its margin boundary, satisfying $y_i(\mathbf{w}^\top \mathbf{x}_i + b) = 1$. These are the free or margin support vectors.

When $\alpha_i = C$, then $\mu_i = 0$ and $\xi_i$ may be positive. The point lies inside the margin or is misclassified. These are the bound support vectors.

Points with $\alpha_i > 0$ are the support vectors. They alone determine the hyperplane. This sparsity is one of the SVM’s most appealing properties: the model is defined by a usually small subset of the training data, the examples that lie on or violate the margin, while the comfortably classified majority is irrelevant to the final boundary.

116.5.2 5.2 Recovering the bias and making predictions

The weight vector follows from $\mathbf{w} = \sum_i \alpha_i y_i \mathbf{x}_i$. To recover $b$, use any free support vector, an index $i$ with $0 < \alpha_i < C$, for which $y_i(\mathbf{w}^\top \mathbf{x}_i + b) = 1$ holds exactly. Solving gives $b = y_i - \mathbf{w}^\top \mathbf{x}_i$. For numerical stability one averages this estimate over all free support vectors.

A new point is classified by

\[ f(\mathbf{x}) = \operatorname{sign}\!\left( \sum_{i \in \text{SV}} \alpha_i y_i \, \mathbf{x}_i^\top \mathbf{x} + b \right), \]

where the sum runs only over support vectors. For a linear SVM it is cheaper to precompute $\mathbf{w}$ and evaluate $\mathbf{w}^\top \mathbf{x} + b$ directly, an $O(d)$ operation per prediction. The dual form of the prediction matters most when a kernel replaces the inner product.

Code

# Fit a linear SVM and visualize the max-margin hyperplane,
# its two supporting hyperplanes, and the support vectors.
import numpy as np
import matplotlib.pyplot as plt
from sklearn.svm import SVC
from sklearn.datasets import make_blobs

X, y = make_blobs(n_samples=60, centers=2, cluster_std=1.05, random_state=42)

clf = SVC(kernel="linear", C=10.0)
clf.fit(X, y)

w = clf.coef_[0]
b = clf.intercept_[0]
print(f"weight vector w = {np.round(w, 3)}")
print(f"bias b = {b:.3f}")
print(f"margin width 2/||w|| = {2.0 / np.linalg.norm(w):.3f}")
print(f"number of support vectors = {len(clf.support_)}")

fig, ax = plt.subplots(figsize=(6, 5))
ax.scatter(X[:, 0], X[:, 1], c=y, cmap="coolwarm", s=30, edgecolors="k")

xx = np.linspace(X[:, 0].min() - 1, X[:, 0].max() + 1, 200)
yy = np.linspace(X[:, 1].min() - 1, X[:, 1].max() + 1, 200)
XX, YY = np.meshgrid(xx, yy)
Z = clf.decision_function(np.c_[XX.ravel(), YY.ravel()]).reshape(XX.shape)
ax.contour(XX, YY, Z, levels=[-1, 0, 1], colors="k",
           linestyles=["--", "-", "--"], linewidths=[1, 2, 1])
ax.scatter(clf.support_vectors_[:, 0], clf.support_vectors_[:, 1],
           s=160, facecolors="none", edgecolors="lime", linewidths=2,
           label="support vectors")
ax.set_title("Max-margin hyperplane and support vectors")
ax.legend(loc="best")
fig.tight_layout()
plt.show()

weight vector w = [ 0.243 -0.239]
bias b = 1.220
margin width 2/||w|| = 5.869
number of support vectors = 2

For the linear case it is cheapest to precompute the weight vector $\mathbf{w} = \sum_i \alpha_i y_i \mathbf{x}_i$ once and evaluate $\mathbf{w}^\top \mathbf{x} + b$ directly, rather than summing over support vectors at each prediction.

116.6 6. The Hinge Loss View

116.6.1 6.1 Unconstrained reformulation

The soft-margin SVM can be rewritten as an unconstrained optimization that exposes its connection to the broader family of regularized risk minimizers. At the optimum, the slack satisfies $\xi_i = \max(0,\, 1 - y_i(\mathbf{w}^\top \mathbf{x}_i + b))$, because the optimizer drives each $\xi_i$ to the smallest value its constraints allow. Substituting this into the objective gives

\[ \min_{\mathbf{w}, b} \ \frac{1}{2}\lVert \mathbf{w} \rVert^2 + C \sum_{i=1}^n \max\!\big(0,\, 1 - y_i(\mathbf{w}^\top \mathbf{x}_i + b)\big). \]

The function $\ell(m) = \max(0, 1 - m)$, where $m = y_i(\mathbf{w}^\top \mathbf{x}_i + b)$ is the functional margin of point $i$, is the hinge loss. The SVM is therefore regularized empirical risk minimization: minimize the average hinge loss plus an $\ell_2$ penalty on the weights. Dividing through by $Cn$ recovers the more familiar form with regularization parameter $\lambda = 1/(Cn)$.

116.6.2 6.2 Properties of the hinge loss

The hinge loss is zero whenever $m \ge 1$, meaning a point classified correctly and outside the margin incurs no penalty. For $m < 1$ the loss grows linearly in the margin deficit. This piecewise-linear shape has two important consequences.

First, the flat region at $m \ge 1$ is what produces sparse support vectors: only points with $m \le 1$ have nonzero loss and therefore influence the gradient and the solution. Second, the hinge is a convex upper bound on the zero-one classification loss, which counts the number of mistakes. Minimizing the hinge is a tractable convex surrogate for the intractable goal of minimizing misclassifications. Unlike the logistic loss, the hinge is exactly zero past the margin rather than merely small, which yields the sparsity and the maximum-margin geometry.

The hinge is not differentiable at $m = 1$, but it is convex, so subgradient methods apply. A subgradient of the per-example loss with respect to $\mathbf{w}$ is

\[ \nabla_{\mathbf{w}} \ell_i = \begin{cases} -\,y_i \mathbf{x}_i & \text{if } y_i(\mathbf{w}^\top \mathbf{x}_i + b) < 1, \\ \mathbf{0} & \text{otherwise.} \end{cases} \]

116.6.3 6.3 Solving by subgradient descent

The hinge-loss view motivates a family of large-scale solvers that bypass the quadratic program entirely. Stochastic subgradient descent, exemplified by the Pegasos algorithm, samples one example at a time and takes a step that combines the regularization gradient with the hinge subgradient.

Code

# Pegasos: stochastic subgradient descent for a linear SVM.
# The bias is folded in as an extra constant feature.
import numpy as np
from sklearn.datasets import make_blobs

rng = np.random.RandomState(0)
X, y01 = make_blobs(n_samples=200, centers=2, cluster_std=1.2, random_state=0)
y = np.where(y01 == 0, -1.0, 1.0)
X = np.hstack([X, np.ones((X.shape[0], 1))])  # bias as extra feature

n, d = X.shape
lam = 0.01            # regularization strength
T = 5000              # number of iterations
w = np.zeros(d)
for t in range(1, T + 1):
    i = rng.randint(n)
    eta = 1.0 / (lam * t)            # decaying step size
    margin = y[i] * (w @ X[i])
    if margin < 1:
        w = (1 - eta * lam) * w + eta * y[i] * X[i]
    else:
        w = (1 - eta * lam) * w

acc = np.mean(np.sign(X @ w) == y)
print(f"trained weight (incl. bias) = {np.round(w, 3)}")
print(f"training accuracy = {acc:.3f}")

trained weight (incl. bias) = [ 0.371 -1.08   2.24 ]
training accuracy = 0.930

Each update is $O(d)$ and touches a single example, so the method scales to millions of samples and very high dimensions. Coordinate ascent on the dual and specialized decomposition methods such as Sequential Minimal Optimization (SMO) are the other dominant solvers; SMO is the engine inside LIBSVM, while linear solvers such as LIBLINEAR and stochastic subgradient methods dominate for large linear problems.

116.7 7. Practical Considerations

116.7.1 7.1 Feature scaling and preprocessing

Because the SVM objective penalizes $\lVert \mathbf{w} \rVert^2$ and measures margins through inner products, it is sensitive to the scale of the features. A feature measured in large units will receive a disproportionately small weight and can dominate the geometry. Standardizing each feature to zero mean and unit variance, or scaling to a fixed range, is essential before training. Always fit the scaler on the training set and apply the same transformation to validation and test data to avoid leakage.

116.7.2 7.2 Choosing C and assessing the model

Select $C$ by cross validation over a logarithmic grid. Inspect the number of support vectors as a diagnostic: a model where almost every training point is a support vector is usually overfit or operating with too small a $C$, while a very sparse solution suggests an easy, well-separated problem. For imbalanced classes, most implementations expose per-class weights that effectively use a different $C$ for each class, which prevents the majority class from swamping the margin.

116.7.3 7.3 When to reach for a linear SVM

The linear SVM shines on high-dimensional sparse data, particularly bag-of-words text representations, where it is fast to train, memory efficient, and competitive with far more complex models. It provides a calibrated sense of margin and a sparse, interpretable model through the support vectors. When the decision boundary is genuinely nonlinear and the feature count is moderate, a kernel SVM extends the same dual machinery, which is the subject of the next chapter.

116.8 References

Cortes, C., and Vapnik, V. “Support-Vector Networks.” Machine Learning, 1995. https://link.springer.com/article/10.1007/BF00994018
Vapnik, V. The Nature of Statistical Learning Theory. Springer, 1995. https://link.springer.com/book/10.1007/978-1-4757-3264-1
Boser, B., Guyon, I., and Vapnik, V. “A Training Algorithm for Optimal Margin Classifiers.” COLT, 1992. https://dl.acm.org/doi/10.1145/130385.130401
Platt, J. “Sequential Minimal Optimization: A Fast Algorithm for Training Support Vector Machines.” Microsoft Research Technical Report, 1998. https://www.microsoft.com/en-us/research/publication/sequential-minimal-optimization-a-fast-algorithm-for-training-support-vector-machines/
Shalev-Shwartz, S., Singer, Y., Srebro, N., and Cotter, A. “Pegasos: Primal Estimated sub-GrAdient SOlver for SVM.” Mathematical Programming, 2011. https://link.springer.com/article/10.1007/s10107-010-0420-4
Fan, R., Chang, K., Hsieh, C., Wang, X., and Lin, C. “LIBLINEAR: A Library for Large Linear Classification.” Journal of Machine Learning Research, 2008. https://www.jmlr.org/papers/v9/fan08a.html
Boyd, S., and Vandenberghe, L. Convex Optimization. Cambridge University Press, 2004. https://web.stanford.edu/~boyd/cvxbook/
Bishop, C. Pattern Recognition and Machine Learning, Chapter 7. Springer, 2006. https://www.microsoft.com/en-us/research/publication/pattern-recognition-machine-learning/

# Linear Support Vector Machines Support vector machines (SVMs) occupy a special place in the history of machine learning. They emerged from statistical learning theory in the 1990s and dominated practical classification for more than a decade before deep networks reclaimed the spotlight. Even today they remain a workhorse for tabular data, text classification, and any setting where the number of features is large relative to the number of samples. The linear SVM is the natural starting point: it captures the central idea of margin maximization in a form that is geometrically transparent, computationally tractable, and rich enough to motivate the kernel extensions that follow. This chapter develops the linear SVM from first principles. We begin with the geometry of the maximum-margin hyperplane, formalize the hard-margin and soft-margin problems, derive the dual via Lagrangian duality, interpret the support vectors, and finally recast the entire method as empirical risk minimization with the hinge loss. By the end you should be able to read an SVM solver's output, reason about its hyperparameters, and explain why the method generalizes well in high dimensions. ## 1. The Geometry of Separating Hyperplanes ### 1.1 Linear classifiers and decision boundaries Consider a binary classification problem with training data $\{(\mathbf{x}_i, y_i)\}_{i=1}^n$, where $\mathbf{x}_i \in \mathbb{R}^d$ and labels $y_i \in \{-1, +1\}$. A linear classifier is defined by a weight vector $\mathbf{w} \in \mathbb{R}^d$ and a bias $b \in \mathbb{R}$, producing the decision function $$ f(\mathbf{x}) = \operatorname{sign}(\mathbf{w}^\top \mathbf{x} + b). $$ The set $\{\mathbf{x} : \mathbf{w}^\top \mathbf{x} + b = 0\}$ is a hyperplane that splits the input space into two half-spaces. The vector $\mathbf{w}$ is orthogonal to this hyperplane, and $b$ shifts it away from the origin. Any point classified positive lies on the side that $\mathbf{w}$ points toward. The data are linearly separable if some $(\mathbf{w}, b)$ classifies every training point correctly, that is, $y_i(\mathbf{w}^\top \mathbf{x}_i + b) > 0$ for all $i$. When a separating hyperplane exists, infinitely many of them exist: you can tilt or translate the boundary slightly and still separate the classes. The perceptron algorithm finds one of these arbitrarily, with no preference among them. The SVM asks a sharper question. Of all separating hyperplanes, which one is best? ### 1.2 The margin The SVM answer is the hyperplane that sits as far as possible from the nearest data points of either class. Intuitively, a boundary that clears the data by a wide buffer is more robust to noise and more likely to generalize, because small perturbations of a test point will not push it across the boundary. The signed distance from a point $\mathbf{x}_i$ to the hyperplane is $$ \frac{\mathbf{w}^\top \mathbf{x}_i + b}{\lVert \mathbf{w} \rVert}. $$ Multiplying by the label $y_i$ gives the functional margin scaled to a geometric quantity. The geometric margin of the point is $y_i(\mathbf{w}^\top \mathbf{x}_i + b) / \lVert \mathbf{w} \rVert$, which is positive exactly when the point is correctly classified. The margin of the classifier is the smallest such distance over the whole training set: $$ \gamma = \min_{i} \frac{y_i(\mathbf{w}^\top \mathbf{x}_i + b)}{\lVert \mathbf{w} \rVert}. $$ The maximum-margin principle says: choose $(\mathbf{w}, b)$ to maximize $\gamma$. This single criterion turns an underdetermined problem with infinitely many solutions into a well-posed optimization with a unique answer (when the data are separable). ## 2. The Hard-Margin SVM ### 2.1 From margin to a clean optimization Maximizing $\gamma$ directly is awkward because $(\mathbf{w}, b)$ and the margin can be rescaled together without changing the hyperplane. If we multiply both $\mathbf{w}$ and $b$ by any positive constant, the decision boundary is identical but the functional margin scales. We remove this redundancy by fixing the scale: require the closest points to satisfy $y_i(\mathbf{w}^\top \mathbf{x}_i + b) = 1$. This is called the canonical form. Under this normalization the geometric margin becomes $1 / \lVert \mathbf{w} \rVert$, and the total width of the margin band, the distance between the two parallel supporting hyperplanes $\mathbf{w}^\top\mathbf{x}+b = \pm 1$, is $2 / \lVert \mathbf{w} \rVert$. Maximizing $1/\lVert \mathbf{w} \rVert$ is the same as minimizing $\lVert \mathbf{w} \rVert$, and for analytic convenience we minimize $\tfrac{1}{2}\lVert \mathbf{w} \rVert^2$. The hard-margin SVM is therefore the constrained problem $$ \min_{\mathbf{w}, b} \ \frac{1}{2}\lVert \mathbf{w} \rVert^2 \quad \text{subject to} \quad y_i(\mathbf{w}^\top \mathbf{x}_i + b) \ge 1, \quad i = 1, \dots, n. $$ This is a convex quadratic program: a quadratic objective with linear inequality constraints. Convexity guarantees that any local optimum is global, so there is no risk of getting stuck in a bad solution. The constraints encode that every point must lie on the correct side of its supporting hyperplane, no closer to the boundary than the margin allows. ### 2.2 Limitations of the hard margin The hard-margin formulation has two serious weaknesses. First, it is feasible only when the data are perfectly linearly separable. A single mislabeled point or a touch of class overlap makes the constraint set empty and the problem has no solution. Second, even when separability holds, the solution can be dictated by outliers. One anomalous point near the boundary can dramatically narrow the margin and rotate the hyperplane, hurting generalization. Real data are rarely cleanly separable, so we need a formulation that tolerates violations. That is the soft margin. ## 3. The Soft-Margin SVM ### 3.1 Slack variables To allow some points to violate the margin, we introduce a slack variable $\xi_i \ge 0$ for each training example and relax the constraint to $$ y_i(\mathbf{w}^\top \mathbf{x}_i + b) \ge 1 - \xi_i. $$ The slack $\xi_i$ measures how far point $i$ intrudes into or past the margin. If $\xi_i = 0$, the point satisfies the original hard constraint and lies outside the margin. If $0 < \xi_i \le 1$, the point is inside the margin band but still on the correct side of the decision boundary. If $\xi_i > 1$, the point is misclassified. We do not want slack to be free, so we penalize the total slack in the objective: $$ \min_{\mathbf{w}, b, \boldsymbol{\xi}} \ \frac{1}{2}\lVert \mathbf{w} \rVert^2 + C \sum_{i=1}^n \xi_i \quad \text{subject to} \quad y_i(\mathbf{w}^\top \mathbf{x}_i + b) \ge 1 - \xi_i, \ \ \xi_i \ge 0. $$ This remains a convex quadratic program. It always has a feasible solution because we can satisfy any constraint by making $\xi_i$ large enough, so non-separable data pose no problem. ### 3.2 The role of the regularization parameter C The constant $C > 0$ controls the trade-off between two competing goals: a wide margin (small $\lVert \mathbf{w} \rVert$) and few margin violations (small $\sum \xi_i$). It is the single most important hyperparameter of the linear SVM. A large $C$ heavily penalizes slack, pushing the optimizer to classify training points correctly even at the cost of a narrow margin. In the limit $C \to \infty$ the soft margin recovers the hard margin. This regime risks overfitting, because the boundary contorts to accommodate noisy points. A small $C$ tolerates more violations in exchange for a wider, smoother margin. This regularizes the model and often generalizes better, but if $C$ is too small the classifier underfits and may ignore real structure. In practice $C$ is chosen by cross validation, typically searching over a logarithmic grid such as $\{10^{-3}, 10^{-2}, \dots, 10^{3}\}$. It helps to see $C$ as the inverse of a regularization strength. Rewriting the objective as $\sum_i \xi_i + \tfrac{1}{2C}\lVert \mathbf{w} \rVert^2$ shows that $1/C$ multiplies the regularizer. Large $C$ means weak regularization, small $C$ means strong regularization. ## 4. The Dual Formulation ### 4.1 The Lagrangian The constrained problem above is the primal. Its dual, obtained through Lagrangian duality, is central to both the theory and the practical solution of SVMs. We attach a multiplier $\alpha_i \ge 0$ to each margin constraint and a multiplier $\mu_i \ge 0$ to each non-negativity constraint $\xi_i \ge 0$. The Lagrangian for the soft-margin problem is $$ \mathcal{L}(\mathbf{w}, b, \boldsymbol{\xi}, \boldsymbol{\alpha}, \boldsymbol{\mu}) = \frac{1}{2}\lVert \mathbf{w} \rVert^2 + C\sum_i \xi_i - \sum_i \alpha_i \big[ y_i(\mathbf{w}^\top \mathbf{x}_i + b) - 1 + \xi_i \big] - \sum_i \mu_i \xi_i. $$ Setting the partial derivatives of $\mathcal{L}$ with respect to the primal variables to zero gives the stationarity conditions: $$ \frac{\partial \mathcal{L}}{\partial \mathbf{w}} = 0 \ \Rightarrow \ \mathbf{w} = \sum_i \alpha_i y_i \mathbf{x}_i, $$ $$ \frac{\partial \mathcal{L}}{\partial b} = 0 \ \Rightarrow \ \sum_i \alpha_i y_i = 0, $$ $$ \frac{\partial \mathcal{L}}{\partial \xi_i} = 0 \ \Rightarrow \ C - \alpha_i - \mu_i = 0. $$ The first condition is remarkable. The optimal weight vector is a linear combination of the training inputs, weighted by $\alpha_i y_i$. The classifier lives entirely in the span of the data. ### 4.2 The dual problem Substituting these conditions back into $\mathcal{L}$ eliminates $\mathbf{w}$, $b$, and $\boldsymbol{\xi}$. The condition $C - \alpha_i - \mu_i = 0$ together with $\mu_i \ge 0$ implies $0 \le \alpha_i \le C$, the so-called box constraint. The resulting dual problem is $$ \max_{\boldsymbol{\alpha}} \ \sum_{i=1}^n \alpha_i - \frac{1}{2}\sum_{i=1}^n \sum_{j=1}^n \alpha_i \alpha_j y_i y_j \, \mathbf{x}_i^\top \mathbf{x}_j $$ $$ \text{subject to} \quad \sum_{i=1}^n \alpha_i y_i = 0, \quad 0 \le \alpha_i \le C. $$ This is again a convex quadratic program, now in the $n$ dual variables $\alpha_i$ rather than the $d$ primal variables of $\mathbf{w}$. Two features deserve emphasis. First, the data appear only through inner products $\mathbf{x}_i^\top \mathbf{x}_j$. This is the doorway to the kernel trick: replace the inner product with a kernel function $k(\mathbf{x}_i, \mathbf{x}_j)$ and the same machinery learns a nonlinear boundary without ever forming the high-dimensional feature map explicitly. Second, the dual has $n$ variables, so it is attractive when $d \gg n$, as in text or genomics, where the feature dimension dwarfs the sample size. ### 4.3 Strong duality and the KKT conditions Because the primal is convex and the constraints are affine, Slater's condition holds and strong duality applies: the optimal primal and dual objective values coincide, and solving the dual recovers the primal solution. At the optimum the Karush-Kuhn-Tucker (KKT) conditions hold. The complementary slackness conditions are the most informative: $$ \alpha_i \big[ y_i(\mathbf{w}^\top \mathbf{x}_i + b) - 1 + \xi_i \big] = 0, \qquad \mu_i \xi_i = 0. $$ These equalities pin down the structure of the solution and lead directly to the notion of support vectors, which we examine next. ## 5. Support Vectors ### 5.1 Three categories of points The complementary slackness conditions partition the training data into three groups according to the value of $\alpha_i$. When $\alpha_i = 0$, the point sits strictly outside the margin and is correctly classified with room to spare. It contributes nothing to $\mathbf{w} = \sum_i \alpha_i y_i \mathbf{x}_i$. You could delete it from the training set and the solution would not change. When $0 < \alpha_i < C$, then $\mu_i = C - \alpha_i > 0$, which forces $\xi_i = 0$. The point lies exactly on its margin boundary, satisfying $y_i(\mathbf{w}^\top \mathbf{x}_i + b) = 1$. These are the free or margin support vectors. When $\alpha_i = C$, then $\mu_i = 0$ and $\xi_i$ may be positive. The point lies inside the margin or is misclassified. These are the bound support vectors. Points with $\alpha_i > 0$ are the support vectors. They alone determine the hyperplane. This sparsity is one of the SVM's most appealing properties: the model is defined by a usually small subset of the training data, the examples that lie on or violate the margin, while the comfortably classified majority is irrelevant to the final boundary. ### 5.2 Recovering the bias and making predictions The weight vector follows from $\mathbf{w} = \sum_i \alpha_i y_i \mathbf{x}_i$. To recover $b$, use any free support vector, an index $i$ with $0 < \alpha_i < C$, for which $y_i(\mathbf{w}^\top \mathbf{x}_i + b) = 1$ holds exactly. Solving gives $b = y_i - \mathbf{w}^\top \mathbf{x}_i$. For numerical stability one averages this estimate over all free support vectors. A new point is classified by $$ f(\mathbf{x}) = \operatorname{sign}\!\left( \sum_{i \in \text{SV}} \alpha_i y_i \, \mathbf{x}_i^\top \mathbf{x} + b \right), $$ where the sum runs only over support vectors. For a linear SVM it is cheaper to precompute $\mathbf{w}$ and evaluate $\mathbf{w}^\top \mathbf{x} + b$ directly, an $O(d)$ operation per prediction. The dual form of the prediction matters most when a kernel replaces the inner product. ```{python} # Fit a linear SVM and visualize the max-margin hyperplane, # its two supporting hyperplanes, and the support vectors. import numpy as np import matplotlib.pyplot as plt from sklearn.svm import SVC from sklearn.datasets import make_blobs X, y = make_blobs(n_samples=60, centers=2, cluster_std=1.05, random_state=42) clf = SVC(kernel="linear", C=10.0) clf.fit(X, y) w = clf.coef_[0] b = clf.intercept_[0] print(f"weight vector w = {np.round(w, 3)}") print(f"bias b = {b:.3f}") print(f"margin width 2/||w|| = {2.0 / np.linalg.norm(w):.3f}") print(f"number of support vectors = {len(clf.support_)}") fig, ax = plt.subplots(figsize=(6, 5)) ax.scatter(X[:, 0], X[:, 1], c=y, cmap="coolwarm", s=30, edgecolors="k") xx = np.linspace(X[:, 0].min() - 1, X[:, 0].max() + 1, 200) yy = np.linspace(X[:, 1].min() - 1, X[:, 1].max() + 1, 200) XX, YY = np.meshgrid(xx, yy) Z = clf.decision_function(np.c_[XX.ravel(), YY.ravel()]).reshape(XX.shape) ax.contour(XX, YY, Z, levels=[-1, 0, 1], colors="k", linestyles=["--", "-", "--"], linewidths=[1, 2, 1]) ax.scatter(clf.support_vectors_[:, 0], clf.support_vectors_[:, 1], s=160, facecolors="none", edgecolors="lime", linewidths=2, label="support vectors") ax.set_title("Max-margin hyperplane and support vectors") ax.legend(loc="best") fig.tight_layout() plt.show() ``` For the linear case it is cheapest to precompute the weight vector $\mathbf{w} = \sum_i \alpha_i y_i \mathbf{x}_i$ once and evaluate $\mathbf{w}^\top \mathbf{x} + b$ directly, rather than summing over support vectors at each prediction. ## 6. The Hinge Loss View ### 6.1 Unconstrained reformulation The soft-margin SVM can be rewritten as an unconstrained optimization that exposes its connection to the broader family of regularized risk minimizers. At the optimum, the slack satisfies $\xi_i = \max(0,\, 1 - y_i(\mathbf{w}^\top \mathbf{x}_i + b))$, because the optimizer drives each $\xi_i$ to the smallest value its constraints allow. Substituting this into the objective gives $$ \min_{\mathbf{w}, b} \ \frac{1}{2}\lVert \mathbf{w} \rVert^2 + C \sum_{i=1}^n \max\!\big(0,\, 1 - y_i(\mathbf{w}^\top \mathbf{x}_i + b)\big). $$ The function $\ell(m) = \max(0, 1 - m)$, where $m = y_i(\mathbf{w}^\top \mathbf{x}_i + b)$ is the functional margin of point $i$, is the hinge loss. The SVM is therefore regularized empirical risk minimization: minimize the average hinge loss plus an $\ell_2$ penalty on the weights. Dividing through by $Cn$ recovers the more familiar form with regularization parameter $\lambda = 1/(Cn)$. ### 6.2 Properties of the hinge loss The hinge loss is zero whenever $m \ge 1$, meaning a point classified correctly and outside the margin incurs no penalty. For $m < 1$ the loss grows linearly in the margin deficit. This piecewise-linear shape has two important consequences. First, the flat region at $m \ge 1$ is what produces sparse support vectors: only points with $m \le 1$ have nonzero loss and therefore influence the gradient and the solution. Second, the hinge is a convex upper bound on the zero-one classification loss, which counts the number of mistakes. Minimizing the hinge is a tractable convex surrogate for the intractable goal of minimizing misclassifications. Unlike the logistic loss, the hinge is exactly zero past the margin rather than merely small, which yields the sparsity and the maximum-margin geometry. The hinge is not differentiable at $m = 1$, but it is convex, so subgradient methods apply. A subgradient of the per-example loss with respect to $\mathbf{w}$ is $$ \nabla_{\mathbf{w}} \ell_i = \begin{cases} -\,y_i \mathbf{x}_i & \text{if } y_i(\mathbf{w}^\top \mathbf{x}_i + b) < 1, \\ \mathbf{0} & \text{otherwise.} \end{cases} $$ ### 6.3 Solving by subgradient descent The hinge-loss view motivates a family of large-scale solvers that bypass the quadratic program entirely. Stochastic subgradient descent, exemplified by the Pegasos algorithm, samples one example at a time and takes a step that combines the regularization gradient with the hinge subgradient. ```{python} # Pegasos: stochastic subgradient descent for a linear SVM. # The bias is folded in as an extra constant feature. import numpy as np from sklearn.datasets import make_blobs rng = np.random.RandomState(0) X, y01 = make_blobs(n_samples=200, centers=2, cluster_std=1.2, random_state=0) y = np.where(y01 == 0, -1.0, 1.0) X = np.hstack([X, np.ones((X.shape[0], 1))]) # bias as extra feature n, d = X.shape lam = 0.01 # regularization strength T = 5000 # number of iterations w = np.zeros(d) for t in range(1, T + 1): i = rng.randint(n) eta = 1.0 / (lam * t) # decaying step size margin = y[i] * (w @ X[i]) if margin < 1: w = (1 - eta * lam) * w + eta * y[i] * X[i] else: w = (1 - eta * lam) * w acc = np.mean(np.sign(X @ w) == y) print(f"trained weight (incl. bias) = {np.round(w, 3)}") print(f"training accuracy = {acc:.3f}") ``` Each update is $O(d)$ and touches a single example, so the method scales to millions of samples and very high dimensions. Coordinate ascent on the dual and specialized decomposition methods such as Sequential Minimal Optimization (SMO) are the other dominant solvers; SMO is the engine inside LIBSVM, while linear solvers such as LIBLINEAR and stochastic subgradient methods dominate for large linear problems. ## 7. Practical Considerations ### 7.1 Feature scaling and preprocessing Because the SVM objective penalizes $\lVert \mathbf{w} \rVert^2$ and measures margins through inner products, it is sensitive to the scale of the features. A feature measured in large units will receive a disproportionately small weight and can dominate the geometry. Standardizing each feature to zero mean and unit variance, or scaling to a fixed range, is essential before training. Always fit the scaler on the training set and apply the same transformation to validation and test data to avoid leakage. ### 7.2 Choosing C and assessing the model Select $C$ by cross validation over a logarithmic grid. Inspect the number of support vectors as a diagnostic: a model where almost every training point is a support vector is usually overfit or operating with too small a $C$, while a very sparse solution suggests an easy, well-separated problem. For imbalanced classes, most implementations expose per-class weights that effectively use a different $C$ for each class, which prevents the majority class from swamping the margin. ### 7.3 When to reach for a linear SVM The linear SVM shines on high-dimensional sparse data, particularly bag-of-words text representations, where it is fast to train, memory efficient, and competitive with far more complex models. It provides a calibrated sense of margin and a sparse, interpretable model through the support vectors. When the decision boundary is genuinely nonlinear and the feature count is moderate, a kernel SVM extends the same dual machinery, which is the subject of the next chapter. ## References 1. Cortes, C., and Vapnik, V. "Support-Vector Networks." Machine Learning, 1995. https://link.springer.com/article/10.1007/BF00994018 2. Vapnik, V. The Nature of Statistical Learning Theory. Springer, 1995. https://link.springer.com/book/10.1007/978-1-4757-3264-1 3. Boser, B., Guyon, I., and Vapnik, V. "A Training Algorithm for Optimal Margin Classifiers." COLT, 1992. https://dl.acm.org/doi/10.1145/130385.130401 4. Platt, J. "Sequential Minimal Optimization: A Fast Algorithm for Training Support Vector Machines." Microsoft Research Technical Report, 1998. https://www.microsoft.com/en-us/research/publication/sequential-minimal-optimization-a-fast-algorithm-for-training-support-vector-machines/ 5. Shalev-Shwartz, S., Singer, Y., Srebro, N., and Cotter, A. "Pegasos: Primal Estimated sub-GrAdient SOlver for SVM." Mathematical Programming, 2011. https://link.springer.com/article/10.1007/s10107-010-0420-4 6. Fan, R., Chang, K., Hsieh, C., Wang, X., and Lin, C. "LIBLINEAR: A Library for Large Linear Classification." Journal of Machine Learning Research, 2008. https://www.jmlr.org/papers/v9/fan08a.html 7. Boyd, S., and Vandenberghe, L. Convex Optimization. Cambridge University Press, 2004. https://web.stanford.edu/~boyd/cvxbook/ 8. Bishop, C. Pattern Recognition and Machine Learning, Chapter 7. Springer, 2006. https://www.microsoft.com/en-us/research/publication/pattern-recognition-machine-learning/