182 The Perceptron

The perceptron occupies a singular place in the history of machine learning. Introduced by Frank Rosenblatt in 1958, it was the first learning algorithm equipped with a rigorous convergence guarantee, and it furnished the conceptual template for nearly every supervised classifier that followed. This chapter develops the perceptron as a mathematical object, presents its learning rule, proves the convergence theorem for linearly separable data, examines the celebrated XOR limitation that stalled the field for a decade, and situates the model within the longer arc of neural computation.

182.1 1. The Model

182.1.1 1.1 Definition

A perceptron is a binary linear classifier. It maps an input vector $\mathbf{x} \in \mathbb{R}^d$ to a label in $\{-1, +1\}$ through a weighted sum followed by a threshold. Formally, given a weight vector $\mathbf{w} \in \mathbb{R}^d$ and a bias scalar $b \in \mathbb{R}$, the perceptron computes

\[ \hat{y} = \operatorname{sign}(\mathbf{w}^\top \mathbf{x} + b), \]

where $\operatorname{sign}(z) = +1$ if $z \geq 0$ and $-1$ otherwise. The quantity $\mathbf{w}^\top \mathbf{x} + b$ is called the activation or the pre-activation, and the function $\operatorname{sign}(\cdot)$ is the activation function, sometimes called the Heaviside or threshold nonlinearity.

It is convenient to absorb the bias into the weight vector. Augment every input with a constant feature equal to one, so that $\tilde{\mathbf{x}} = (\mathbf{x}, 1) \in \mathbb{R}^{d+1}$ and $\tilde{\mathbf{w}} = (\mathbf{w}, b) \in \mathbb{R}^{d+1}$. Then $\mathbf{w}^\top \mathbf{x} + b = \tilde{\mathbf{w}}^\top \tilde{\mathbf{x}}$, and the classifier reduces to $\hat{y} = \operatorname{sign}(\tilde{\mathbf{w}}^\top \tilde{\mathbf{x}})$. For the remainder of this chapter we adopt this augmented convention and drop the tildes, writing $\mathbf{w}$ and $\mathbf{x}$ for the augmented vectors unless stated otherwise.

182.1.2 1.2 Geometric interpretation

The decision boundary of a perceptron is the set of points where the activation vanishes:

\[ \{\mathbf{x} : \mathbf{w}^\top \mathbf{x} = 0\}. \]

This is a hyperplane through the origin in the augmented space, or equivalently an affine hyperplane $\mathbf{w}^\top \mathbf{x} + b = 0$ in the original $d$ dimensional space. The vector $\mathbf{w}$ is the normal to this hyperplane and points toward the positive half space. The signed distance from a point $\mathbf{x}$ to the boundary is $(\mathbf{w}^\top \mathbf{x} + b) / \lVert \mathbf{w} \rVert$. Classification therefore amounts to asking which side of the hyperplane a point lies on. Because the boundary is linear, a perceptron can only express decision regions that are linearly separable, a constraint whose consequences occupy much of this chapter.

182.1.3 1.3 Relation to biological neurons

Rosenblatt conceived the perceptron as an idealized model of a neuron, building on the threshold logic unit of McCulloch and Pitts (1943). In that analogy the components of $\mathbf{x}$ are incoming signals from upstream cells, the weights $\mathbf{w}$ are synaptic efficacies, and the threshold models the firing condition of the cell body. The biological framing motivated the vocabulary that machine learning still uses, including weights, activation, and learning, even though the perceptron abstracts away almost all of the dynamics of real neurons.

182.2 2. The Perceptron Learning Algorithm

182.2.1 2.1 The update rule

Learning means choosing $\mathbf{w}$ from a training set $\{(\mathbf{x}_i, y_i)\}_{i=1}^{n}$ with $y_i \in \{-1, +1\}$. The perceptron does this online, processing one example at a time. The driving idea is mistake driven correction: leave the weights unchanged when a prediction is correct, and nudge them toward the correct answer when it is wrong.

A point $(\mathbf{x}_i, y_i)$ is misclassified by the current weights precisely when $y_i (\mathbf{w}^\top \mathbf{x}_i) \leq 0$, since a correct prediction makes the activation agree in sign with the label. On a mistake the algorithm applies

\[ \mathbf{w} \leftarrow \mathbf{w} + \eta \, y_i \, \mathbf{x}_i, \]

where $\eta > 0$ is the learning rate. The geometry is transparent. After the update, the new activation on $\mathbf{x}_i$ is

\[ y_i (\mathbf{w} + \eta y_i \mathbf{x}_i)^\top \mathbf{x}_i = y_i \mathbf{w}^\top \mathbf{x}_i + \eta \lVert \mathbf{x}_i \rVert^2, \]

which is strictly larger than the old activation $y_i \mathbf{w}^\top \mathbf{x}_i$. Each correction therefore moves the offending point in the right direction, though not necessarily far enough to fix it in one step. The learning rate $\eta$ only rescales the weights and the number of updates; with zero initialization it has no effect on which points are misclassified, so $\eta = 1$ is the usual choice.

182.2.2 2.2 The procedure

The full algorithm sweeps repeatedly through the data, applying corrections until a full pass produces no mistakes. The implementation below is a direct from-scratch translation using numpy, with the bias absorbed into the augmented weight vector as described above.

Code

import numpy as np

def perceptron_fit(X, y, max_epochs=100):
    # Augment each input with a constant feature of one for the bias
    Xa = np.hstack([X, np.ones((X.shape[0], 1))])
    w = np.zeros(Xa.shape[1])
    mistakes_per_epoch = []
    for _ in range(max_epochs):
        mistakes = 0
        for i in range(Xa.shape[0]):
            if y[i] * (w @ Xa[i]) <= 0:        # misclassified
                w = w + y[i] * Xa[i]           # mistake driven update
                mistakes += 1
        mistakes_per_epoch.append(mistakes)
        if mistakes == 0:                      # a clean pass, so halt
            break
    return w, mistakes_per_epoch

If the data are linearly separable, this loop terminates, a fact established in the next section. If they are not, the loop never halts on its own, and in practice one imposes a maximum number of epochs.

Running the algorithm on two well separated Gaussian clusters recovers a separating hyperplane in a handful of passes, and the learned weights $(w_1, w_2, b)$ define the boundary $w_1 x_1 + w_2 x_2 + b = 0$ drawn below.

Code

import matplotlib.pyplot as plt

rng = np.random.default_rng(0)
n = 60
X_pos = rng.normal(loc=[2.0, 2.0], scale=0.7, size=(n, 2))
X_neg = rng.normal(loc=[-2.0, -2.0], scale=0.7, size=(n, 2))
X = np.vstack([X_pos, X_neg])
y = np.hstack([np.ones(n), -np.ones(n)])

w, history = perceptron_fit(X, y)
acc = np.mean(np.sign(np.hstack([X, np.ones((X.shape[0], 1))]) @ w) == y)
print(f"Converged in {len(history)} epochs, training accuracy {acc:.3f}")
print(f"Learned weights (w1, w2, bias): {w.round(3)}")

fig, ax = plt.subplots(figsize=(6, 5))
ax.scatter(X_pos[:, 0], X_pos[:, 1], c="tab:blue", label="+1", edgecolor="k")
ax.scatter(X_neg[:, 0], X_neg[:, 1], c="tab:red", label="-1", edgecolor="k")
xs = np.linspace(X[:, 0].min() - 1, X[:, 0].max() + 1, 200)
ys = -(w[0] * xs + w[2]) / w[1]
ax.plot(xs, ys, "k-", lw=2, label="decision boundary")
ax.set_xlabel("x1"); ax.set_ylabel("x2")
ax.set_title("Perceptron decision boundary on separable data")
ax.legend()
plt.show()

Converged in 2 epochs, training accuracy 1.000
Learned weights (w1, w2, bias): [2.088 1.908 1.   ]

182.2.3 2.3 As stochastic gradient descent

The update can be read as stochastic gradient descent on the perceptron criterion

\[ L(\mathbf{w}) = \sum_{i : \, y_i \mathbf{w}^\top \mathbf{x}_i \leq 0} -\, y_i \, \mathbf{w}^\top \mathbf{x}_i, \]

which sums the magnitude of the activation over misclassified points and is zero when every point is correct. The gradient contributed by a single misclassified example is $-y_i \mathbf{x}_i$, so descending along $-\nabla L$ on that example reproduces $\mathbf{w} \leftarrow \mathbf{w} + \eta y_i \mathbf{x}_i$. This view connects the 1958 rule to the modern optimization framework, although the perceptron loss is piecewise linear and does not penalize the size of the margin, which distinguishes it from the hinge loss of the support vector machine.

182.3 3. The Convergence Theorem

182.3.1 3.1 Statement

The central theoretical result, proved by Block and by Novikoff in 1962, guarantees that the algorithm stops after finitely many mistakes whenever a separating hyperplane exists.

Theorem (Perceptron convergence). Suppose there exists a unit vector $\mathbf{w}^\star$ with $\lVert \mathbf{w}^\star \rVert = 1$ and a margin $\gamma > 0$ such that

\[ y_i \, (\mathbf{w}^\star)^\top \mathbf{x}_i \geq \gamma \quad \text{for all } i, \]

and suppose $\lVert \mathbf{x}_i \rVert \leq R$ for all $i$. Then the perceptron algorithm, started from $\mathbf{w} = \mathbf{0}$ with $\eta = 1$, makes at most

\[ \left( \frac{R}{\gamma} \right)^2 \]

mistakes before classifying every training point correctly.

The bound is striking. It does not depend on the dimension $d$ or on the number of examples $n$. It depends only on the ratio of the data radius $R$ to the separation margin $\gamma$. Easy problems, those with a wide margin, are solved quickly, while problems whose classes nearly touch may require many corrections.

182.3.2 3.2 Proof

Let $\mathbf{w}_k$ denote the weight vector after the $k$-th mistake, with $\mathbf{w}_0 = \mathbf{0}$. Suppose the $k$-th mistake occurs on example $(\mathbf{x}_i, y_i)$, so $\mathbf{w}_k = \mathbf{w}_{k-1} + y_i \mathbf{x}_i$. The proof tracks two quantities that the mistake bound squeezes together.

Lower bound on the projection onto $\mathbf{w}^\star$. Using the update and the margin condition,

\[ (\mathbf{w}^\star)^\top \mathbf{w}_k = (\mathbf{w}^\star)^\top \mathbf{w}_{k-1} + y_i (\mathbf{w}^\star)^\top \mathbf{x}_i \geq (\mathbf{w}^\star)^\top \mathbf{w}_{k-1} + \gamma. \]

By induction from $(\mathbf{w}^\star)^\top \mathbf{w}_0 = 0$,

\[ (\mathbf{w}^\star)^\top \mathbf{w}_k \geq k \gamma. \]

Upper bound on the norm. Because the $k$-th step corrected a mistake, $y_i \mathbf{w}_{k-1}^\top \mathbf{x}_i \leq 0$, so

\[ \lVert \mathbf{w}_k \rVert^2 = \lVert \mathbf{w}_{k-1} \rVert^2 + 2 y_i \mathbf{w}_{k-1}^\top \mathbf{x}_i + \lVert \mathbf{x}_i \rVert^2 \leq \lVert \mathbf{w}_{k-1} \rVert^2 + R^2. \]

By induction from $\lVert \mathbf{w}_0 \rVert^2 = 0$,

\[ \lVert \mathbf{w}_k \rVert^2 \leq k R^2. \]

Combining the bounds. The Cauchy-Schwarz inequality with $\lVert \mathbf{w}^\star \rVert = 1$ gives $(\mathbf{w}^\star)^\top \mathbf{w}_k \leq \lVert \mathbf{w}_k \rVert$. Chaining the two bounds,

\[ k \gamma \leq (\mathbf{w}^\star)^\top \mathbf{w}_k \leq \lVert \mathbf{w}_k \rVert \leq \sqrt{k} \, R. \]

Squaring the outer inequality $k \gamma \leq \sqrt{k} R$ yields $k^2 \gamma^2 \leq k R^2$, hence

\[ k \leq \left( \frac{R}{\gamma} \right)^2. \]

The number of mistakes is finite, so the algorithm halts. $\blacksquare$

182.3.3 3.3 Caveats

Three points deserve emphasis. First, the theorem bounds mistakes, not epochs, but since each epoch with at least one mistake performs at least one correction, the epoch count is also finite. Second, convergence says nothing about the quality of the resulting boundary beyond consistency on the training set; the perceptron finds some separator, not the maximum margin separator that a support vector machine would return. Third, separability is essential. When the data are not linearly separable, no $\mathbf{w}^\star$ with positive margin exists, the bound is vacuous, and the weights cycle indefinitely. The pocket algorithm of Gallant (1990) addresses this by retaining the best weights seen so far.

182.4 4. The XOR Limitation

182.4.1 4.1 A function the perceptron cannot learn

The exclusive or function takes two binary inputs and returns one when exactly one input is on. On the four corners of the unit square it produces

\[ \text{XOR}(0,0) = 0, \quad \text{XOR}(0,1) = 1, \quad \text{XOR}(1,0) = 1, \quad \text{XOR}(1,1) = 0. \]

The positive examples $(0,1)$ and $(1,0)$ sit on one diagonal, while the negative examples $(0,0)$ and $(1,1)$ sit on the other. No straight line separates the two diagonals, so XOR is not linearly separable.

A short contradiction makes this precise. Suppose weights $w_1, w_2$ and bias $b$ classified XOR correctly with threshold at zero. The positive cases require

\[ w_1 + b > 0 \quad \text{and} \quad w_2 + b > 0, \]

and the negative cases require

\[ b \leq 0 \quad \text{and} \quad w_1 + w_2 + b \leq 0. \]

Adding the two positive inequalities gives $w_1 + w_2 + 2b > 0$, while adding the two negative inequalities gives $w_1 + w_2 + 2b \leq 0$. These cannot both hold, so no single perceptron computes XOR.

We can watch the algorithm fail empirically. Running the same learning rule on the four XOR corners never produces a clean pass, no matter how many epochs we allow, because no separating hyperplane exists.

Code

X_xor = np.array([[0, 0], [0, 1], [1, 0], [1, 1]], dtype=float)
y_xor = np.array([-1, 1, 1, -1], dtype=float)

w_xor, history_xor = perceptron_fit(X_xor, y_xor, max_epochs=50)
converged = history_xor[-1] == 0
print(f"XOR converged within 50 epochs: {converged}")
print(f"Mistakes on the final epoch: {history_xor[-1]} of 4 points")

XOR converged within 50 epochs: False
Mistakes on the final epoch: 4 of 4 points

182.4.2 4.2 The Minsky and Papert critique

In their 1969 monograph Perceptrons, Marvin Minsky and Seymour Papert analyzed the representational limits of single layer perceptrons with mathematical care. XOR was the simplest emblem of a broader message: a perceptron whose features have bounded support cannot compute predicates such as connectedness or parity of an image, because these require evidence to be combined globally. The book was sometimes read, more pessimistically than its authors intended, as a verdict on neural networks in general, and it contributed to a sharp decline in funding and interest often called the first AI winter.

182.4.3 4.3 The resolution

The limitation is a property of a single linear layer, not of networked threshold units. XOR can be written as a composition,

\[ \text{XOR}(x_1, x_2) = \text{AND}\big(\text{OR}(x_1, x_2),\ \text{NAND}(x_1, x_2)\big), \]

each of whose constituents is linearly separable. A network with one hidden layer of two units followed by an output unit therefore computes XOR exactly. More generally the universal approximation results of the late 1980s established that a feedforward network with a single sufficiently wide hidden layer can approximate any continuous function on a compact domain. What was missing in 1969 was not expressive power but a practical algorithm to train the hidden weights, since the perceptron rule applies only to a single layer where the error is directly observable. That gap was closed when backpropagation was popularized by Rumelhart, Hinton, and Williams in 1986, giving a way to assign credit to hidden units and reviving the multilayer program.

182.5 5. Historical Importance

182.5.1 5.1 A foundational template

The perceptron established a pattern of thinking that still organizes the field. It framed learning as the adjustment of real valued parameters in response to errors, it tied a learning rule to a provable guarantee, and it exposed the role of separability and margin in classification. The convergence proof is the ancestor of the online learning and mistake bound analyses that mature in the work of Littlestone and others, and the margin $\gamma$ reappears at the center of support vector machine theory.

182.5.2 5.2 From a single unit to deep networks

Every neuron in a modern deep network is a perceptron with a smooth activation function in place of the hard threshold. Replacing $\operatorname{sign}$ with the logistic or rectified linear function makes the unit differentiable, which lets gradients flow through many stacked layers. Seen this way, contemporary architectures are vast compositions of perceptron like units, and the conceptual distance from Rosenblatt’s machine to a transformer is one of scale, differentiability, and composition rather than of fundamental kind.

182.5.3 5.3 Lessons

The perceptron story carries two durable lessons. The first is that a clean theoretical guarantee can coexist with a severe practical limitation, since convergence holds only under separability and the model cannot represent functions as simple as XOR. The second is that an apparent dead end can be a missing tool rather than a true barrier. The wall Minsky and Papert described was real for a single layer, yet it dissolved once depth and a credit assignment algorithm arrived. Both lessons remain relevant as the field continues to weigh what its models can represent against what its algorithms can actually learn.

182.6 References

Rosenblatt, F. (1958). The Perceptron: A Probabilistic Model for Information Storage and Organization in the Brain. Psychological Review, 65(6), 386-408. https://psycnet.apa.org/doi/10.1037/h0042519
McCulloch, W. S., and Pitts, W. (1943). A Logical Calculus of the Ideas Immanent in Nervous Activity. Bulletin of Mathematical Biophysics, 5, 115-133. https://link.springer.com/article/10.1007/BF02478259
Novikoff, A. B. (1962). On Convergence Proofs for Perceptrons. Proceedings of the Symposium on the Mathematical Theory of Automata, 12, 615-622. https://cs.uwaterloo.ca/~y328yu/classics/novikoff.pdf
Block, H. D. (1962). The Perceptron: A Model for Brain Functioning. Reviews of Modern Physics, 34(1), 123-135. https://journals.aps.org/rmp/abstract/10.1103/RevModPhys.34.123
Minsky, M., and Papert, S. (1969). Perceptrons: An Introduction to Computational Geometry. MIT Press. https://mitpress.mit.edu/9780262630221/perceptrons/
Rumelhart, D. E., Hinton, G. E., and Williams, R. J. (1986). Learning Representations by Back-Propagating Errors. Nature, 323, 533-536. https://www.nature.com/articles/323533a0
Gallant, S. I. (1990). Perceptron-Based Learning Algorithms. IEEE Transactions on Neural Networks, 1(2), 179-191. https://ieeexplore.ieee.org/document/80230
Freund, Y., and Schapire, R. E. (1999). Large Margin Classification Using the Perceptron Algorithm. Machine Learning, 37(3), 277-296. https://link.springer.com/article/10.1023/A:1007662407062

# The Perceptron The perceptron occupies a singular place in the history of machine learning. Introduced by Frank Rosenblatt in 1958, it was the first learning algorithm equipped with a rigorous convergence guarantee, and it furnished the conceptual template for nearly every supervised classifier that followed. This chapter develops the perceptron as a mathematical object, presents its learning rule, proves the convergence theorem for linearly separable data, examines the celebrated XOR limitation that stalled the field for a decade, and situates the model within the longer arc of neural computation. ## 1. The Model ### 1.1 Definition A perceptron is a binary linear classifier. It maps an input vector $\mathbf{x} \in \mathbb{R}^d$ to a label in $\{-1, +1\}$ through a weighted sum followed by a threshold. Formally, given a weight vector $\mathbf{w} \in \mathbb{R}^d$ and a bias scalar $b \in \mathbb{R}$, the perceptron computes $$ \hat{y} = \operatorname{sign}(\mathbf{w}^\top \mathbf{x} + b), $$ where $\operatorname{sign}(z) = +1$ if $z \geq 0$ and $-1$ otherwise. The quantity $\mathbf{w}^\top \mathbf{x} + b$ is called the activation or the pre-activation, and the function $\operatorname{sign}(\cdot)$ is the activation function, sometimes called the Heaviside or threshold nonlinearity. It is convenient to absorb the bias into the weight vector. Augment every input with a constant feature equal to one, so that $\tilde{\mathbf{x}} = (\mathbf{x}, 1) \in \mathbb{R}^{d+1}$ and $\tilde{\mathbf{w}} = (\mathbf{w}, b) \in \mathbb{R}^{d+1}$. Then $\mathbf{w}^\top \mathbf{x} + b = \tilde{\mathbf{w}}^\top \tilde{\mathbf{x}}$, and the classifier reduces to $\hat{y} = \operatorname{sign}(\tilde{\mathbf{w}}^\top \tilde{\mathbf{x}})$. For the remainder of this chapter we adopt this augmented convention and drop the tildes, writing $\mathbf{w}$ and $\mathbf{x}$ for the augmented vectors unless stated otherwise. ### 1.2 Geometric interpretation The decision boundary of a perceptron is the set of points where the activation vanishes: $$ \{\mathbf{x} : \mathbf{w}^\top \mathbf{x} = 0\}. $$ This is a hyperplane through the origin in the augmented space, or equivalently an affine hyperplane $\mathbf{w}^\top \mathbf{x} + b = 0$ in the original $d$ dimensional space. The vector $\mathbf{w}$ is the normal to this hyperplane and points toward the positive half space. The signed distance from a point $\mathbf{x}$ to the boundary is $(\mathbf{w}^\top \mathbf{x} + b) / \lVert \mathbf{w} \rVert$. Classification therefore amounts to asking which side of the hyperplane a point lies on. Because the boundary is linear, a perceptron can only express decision regions that are linearly separable, a constraint whose consequences occupy much of this chapter. ### 1.3 Relation to biological neurons Rosenblatt conceived the perceptron as an idealized model of a neuron, building on the threshold logic unit of McCulloch and Pitts (1943). In that analogy the components of $\mathbf{x}$ are incoming signals from upstream cells, the weights $\mathbf{w}$ are synaptic efficacies, and the threshold models the firing condition of the cell body. The biological framing motivated the vocabulary that machine learning still uses, including weights, activation, and learning, even though the perceptron abstracts away almost all of the dynamics of real neurons. ## 2. The Perceptron Learning Algorithm ### 2.1 The update rule Learning means choosing $\mathbf{w}$ from a training set $\{(\mathbf{x}_i, y_i)\}_{i=1}^{n}$ with $y_i \in \{-1, +1\}$. The perceptron does this online, processing one example at a time. The driving idea is mistake driven correction: leave the weights unchanged when a prediction is correct, and nudge them toward the correct answer when it is wrong. A point $(\mathbf{x}_i, y_i)$ is misclassified by the current weights precisely when $y_i (\mathbf{w}^\top \mathbf{x}_i) \leq 0$, since a correct prediction makes the activation agree in sign with the label. On a mistake the algorithm applies $$ \mathbf{w} \leftarrow \mathbf{w} + \eta \, y_i \, \mathbf{x}_i, $$ where $\eta > 0$ is the learning rate. The geometry is transparent. After the update, the new activation on $\mathbf{x}_i$ is $$ y_i (\mathbf{w} + \eta y_i \mathbf{x}_i)^\top \mathbf{x}_i = y_i \mathbf{w}^\top \mathbf{x}_i + \eta \lVert \mathbf{x}_i \rVert^2, $$ which is strictly larger than the old activation $y_i \mathbf{w}^\top \mathbf{x}_i$. Each correction therefore moves the offending point in the right direction, though not necessarily far enough to fix it in one step. The learning rate $\eta$ only rescales the weights and the number of updates; with zero initialization it has no effect on which points are misclassified, so $\eta = 1$ is the usual choice. ### 2.2 The procedure The full algorithm sweeps repeatedly through the data, applying corrections until a full pass produces no mistakes. The implementation below is a direct from-scratch translation using numpy, with the bias absorbed into the augmented weight vector as described above. ```{python} import numpy as np def perceptron_fit(X, y, max_epochs=100): # Augment each input with a constant feature of one for the bias Xa = np.hstack([X, np.ones((X.shape[0], 1))]) w = np.zeros(Xa.shape[1]) mistakes_per_epoch = [] for _ in range(max_epochs): mistakes = 0 for i in range(Xa.shape[0]): if y[i] * (w @ Xa[i]) <= 0: # misclassified w = w + y[i] * Xa[i] # mistake driven update mistakes += 1 mistakes_per_epoch.append(mistakes) if mistakes == 0: # a clean pass, so halt break return w, mistakes_per_epoch ``` If the data are linearly separable, this loop terminates, a fact established in the next section. If they are not, the loop never halts on its own, and in practice one imposes a maximum number of epochs. Running the algorithm on two well separated Gaussian clusters recovers a separating hyperplane in a handful of passes, and the learned weights $(w_1, w_2, b)$ define the boundary $w_1 x_1 + w_2 x_2 + b = 0$ drawn below. ```{python} import matplotlib.pyplot as plt rng = np.random.default_rng(0) n = 60 X_pos = rng.normal(loc=[2.0, 2.0], scale=0.7, size=(n, 2)) X_neg = rng.normal(loc=[-2.0, -2.0], scale=0.7, size=(n, 2)) X = np.vstack([X_pos, X_neg]) y = np.hstack([np.ones(n), -np.ones(n)]) w, history = perceptron_fit(X, y) acc = np.mean(np.sign(np.hstack([X, np.ones((X.shape[0], 1))]) @ w) == y) print(f"Converged in {len(history)} epochs, training accuracy {acc:.3f}") print(f"Learned weights (w1, w2, bias): {w.round(3)}") fig, ax = plt.subplots(figsize=(6, 5)) ax.scatter(X_pos[:, 0], X_pos[:, 1], c="tab:blue", label="+1", edgecolor="k") ax.scatter(X_neg[:, 0], X_neg[:, 1], c="tab:red", label="-1", edgecolor="k") xs = np.linspace(X[:, 0].min() - 1, X[:, 0].max() + 1, 200) ys = -(w[0] * xs + w[2]) / w[1] ax.plot(xs, ys, "k-", lw=2, label="decision boundary") ax.set_xlabel("x1"); ax.set_ylabel("x2") ax.set_title("Perceptron decision boundary on separable data") ax.legend() plt.show() ``` ### 2.3 As stochastic gradient descent The update can be read as stochastic gradient descent on the perceptron criterion $$ L(\mathbf{w}) = \sum_{i : \, y_i \mathbf{w}^\top \mathbf{x}_i \leq 0} -\, y_i \, \mathbf{w}^\top \mathbf{x}_i, $$ which sums the magnitude of the activation over misclassified points and is zero when every point is correct. The gradient contributed by a single misclassified example is $-y_i \mathbf{x}_i$, so descending along $-\nabla L$ on that example reproduces $\mathbf{w} \leftarrow \mathbf{w} + \eta y_i \mathbf{x}_i$. This view connects the 1958 rule to the modern optimization framework, although the perceptron loss is piecewise linear and does not penalize the size of the margin, which distinguishes it from the hinge loss of the support vector machine. ## 3. The Convergence Theorem ### 3.1 Statement The central theoretical result, proved by Block and by Novikoff in 1962, guarantees that the algorithm stops after finitely many mistakes whenever a separating hyperplane exists. **Theorem (Perceptron convergence).** Suppose there exists a unit vector $\mathbf{w}^\star$ with $\lVert \mathbf{w}^\star \rVert = 1$ and a margin $\gamma > 0$ such that $$ y_i \, (\mathbf{w}^\star)^\top \mathbf{x}_i \geq \gamma \quad \text{for all } i, $$ and suppose $\lVert \mathbf{x}_i \rVert \leq R$ for all $i$. Then the perceptron algorithm, started from $\mathbf{w} = \mathbf{0}$ with $\eta = 1$, makes at most $$ \left( \frac{R}{\gamma} \right)^2 $$ mistakes before classifying every training point correctly. The bound is striking. It does not depend on the dimension $d$ or on the number of examples $n$. It depends only on the ratio of the data radius $R$ to the separation margin $\gamma$. Easy problems, those with a wide margin, are solved quickly, while problems whose classes nearly touch may require many corrections. ### 3.2 Proof Let $\mathbf{w}_k$ denote the weight vector after the $k$-th mistake, with $\mathbf{w}_0 = \mathbf{0}$. Suppose the $k$-th mistake occurs on example $(\mathbf{x}_i, y_i)$, so $\mathbf{w}_k = \mathbf{w}_{k-1} + y_i \mathbf{x}_i$. The proof tracks two quantities that the mistake bound squeezes together. **Lower bound on the projection onto $\mathbf{w}^\star$.** Using the update and the margin condition, $$ (\mathbf{w}^\star)^\top \mathbf{w}_k = (\mathbf{w}^\star)^\top \mathbf{w}_{k-1} + y_i (\mathbf{w}^\star)^\top \mathbf{x}_i \geq (\mathbf{w}^\star)^\top \mathbf{w}_{k-1} + \gamma. $$ By induction from $(\mathbf{w}^\star)^\top \mathbf{w}_0 = 0$, $$ (\mathbf{w}^\star)^\top \mathbf{w}_k \geq k \gamma. $$ **Upper bound on the norm.** Because the $k$-th step corrected a mistake, $y_i \mathbf{w}_{k-1}^\top \mathbf{x}_i \leq 0$, so $$ \lVert \mathbf{w}_k \rVert^2 = \lVert \mathbf{w}_{k-1} \rVert^2 + 2 y_i \mathbf{w}_{k-1}^\top \mathbf{x}_i + \lVert \mathbf{x}_i \rVert^2 \leq \lVert \mathbf{w}_{k-1} \rVert^2 + R^2. $$ By induction from $\lVert \mathbf{w}_0 \rVert^2 = 0$, $$ \lVert \mathbf{w}_k \rVert^2 \leq k R^2. $$ **Combining the bounds.** The Cauchy-Schwarz inequality with $\lVert \mathbf{w}^\star \rVert = 1$ gives $(\mathbf{w}^\star)^\top \mathbf{w}_k \leq \lVert \mathbf{w}_k \rVert$. Chaining the two bounds, $$ k \gamma \leq (\mathbf{w}^\star)^\top \mathbf{w}_k \leq \lVert \mathbf{w}_k \rVert \leq \sqrt{k} \, R. $$ Squaring the outer inequality $k \gamma \leq \sqrt{k} R$ yields $k^2 \gamma^2 \leq k R^2$, hence $$ k \leq \left( \frac{R}{\gamma} \right)^2. $$ The number of mistakes is finite, so the algorithm halts. $\blacksquare$ ### 3.3 Caveats Three points deserve emphasis. First, the theorem bounds mistakes, not epochs, but since each epoch with at least one mistake performs at least one correction, the epoch count is also finite. Second, convergence says nothing about the quality of the resulting boundary beyond consistency on the training set; the perceptron finds some separator, not the maximum margin separator that a support vector machine would return. Third, separability is essential. When the data are not linearly separable, no $\mathbf{w}^\star$ with positive margin exists, the bound is vacuous, and the weights cycle indefinitely. The pocket algorithm of Gallant (1990) addresses this by retaining the best weights seen so far. ## 4. The XOR Limitation ### 4.1 A function the perceptron cannot learn The exclusive or function takes two binary inputs and returns one when exactly one input is on. On the four corners of the unit square it produces $$ \text{XOR}(0,0) = 0, \quad \text{XOR}(0,1) = 1, \quad \text{XOR}(1,0) = 1, \quad \text{XOR}(1,1) = 0. $$ The positive examples $(0,1)$ and $(1,0)$ sit on one diagonal, while the negative examples $(0,0)$ and $(1,1)$ sit on the other. No straight line separates the two diagonals, so XOR is not linearly separable. A short contradiction makes this precise. Suppose weights $w_1, w_2$ and bias $b$ classified XOR correctly with threshold at zero. The positive cases require $$ w_1 + b > 0 \quad \text{and} \quad w_2 + b > 0, $$ and the negative cases require $$ b \leq 0 \quad \text{and} \quad w_1 + w_2 + b \leq 0. $$ Adding the two positive inequalities gives $w_1 + w_2 + 2b > 0$, while adding the two negative inequalities gives $w_1 + w_2 + 2b \leq 0$. These cannot both hold, so no single perceptron computes XOR. We can watch the algorithm fail empirically. Running the same learning rule on the four XOR corners never produces a clean pass, no matter how many epochs we allow, because no separating hyperplane exists. ```{python} X_xor = np.array([[0, 0], [0, 1], [1, 0], [1, 1]], dtype=float) y_xor = np.array([-1, 1, 1, -1], dtype=float) w_xor, history_xor = perceptron_fit(X_xor, y_xor, max_epochs=50) converged = history_xor[-1] == 0 print(f"XOR converged within 50 epochs: {converged}") print(f"Mistakes on the final epoch: {history_xor[-1]} of 4 points") ``` ### 4.2 The Minsky and Papert critique In their 1969 monograph *Perceptrons*, Marvin Minsky and Seymour Papert analyzed the representational limits of single layer perceptrons with mathematical care. XOR was the simplest emblem of a broader message: a perceptron whose features have bounded support cannot compute predicates such as connectedness or parity of an image, because these require evidence to be combined globally. The book was sometimes read, more pessimistically than its authors intended, as a verdict on neural networks in general, and it contributed to a sharp decline in funding and interest often called the first AI winter. ### 4.3 The resolution The limitation is a property of a single linear layer, not of networked threshold units. XOR can be written as a composition, $$ \text{XOR}(x_1, x_2) = \text{AND}\big(\text{OR}(x_1, x_2),\ \text{NAND}(x_1, x_2)\big), $$ each of whose constituents is linearly separable. A network with one hidden layer of two units followed by an output unit therefore computes XOR exactly. More generally the universal approximation results of the late 1980s established that a feedforward network with a single sufficiently wide hidden layer can approximate any continuous function on a compact domain. What was missing in 1969 was not expressive power but a practical algorithm to train the hidden weights, since the perceptron rule applies only to a single layer where the error is directly observable. That gap was closed when backpropagation was popularized by Rumelhart, Hinton, and Williams in 1986, giving a way to assign credit to hidden units and reviving the multilayer program. ## 5. Historical Importance ### 5.1 A foundational template The perceptron established a pattern of thinking that still organizes the field. It framed learning as the adjustment of real valued parameters in response to errors, it tied a learning rule to a provable guarantee, and it exposed the role of separability and margin in classification. The convergence proof is the ancestor of the online learning and mistake bound analyses that mature in the work of Littlestone and others, and the margin $\gamma$ reappears at the center of support vector machine theory. ### 5.2 From a single unit to deep networks Every neuron in a modern deep network is a perceptron with a smooth activation function in place of the hard threshold. Replacing $\operatorname{sign}$ with the logistic or rectified linear function makes the unit differentiable, which lets gradients flow through many stacked layers. Seen this way, contemporary architectures are vast compositions of perceptron like units, and the conceptual distance from Rosenblatt's machine to a transformer is one of scale, differentiability, and composition rather than of fundamental kind. ### 5.3 Lessons The perceptron story carries two durable lessons. The first is that a clean theoretical guarantee can coexist with a severe practical limitation, since convergence holds only under separability and the model cannot represent functions as simple as XOR. The second is that an apparent dead end can be a missing tool rather than a true barrier. The wall Minsky and Papert described was real for a single layer, yet it dissolved once depth and a credit assignment algorithm arrived. Both lessons remain relevant as the field continues to weigh what its models can represent against what its algorithms can actually learn. ## References 1. Rosenblatt, F. (1958). The Perceptron: A Probabilistic Model for Information Storage and Organization in the Brain. *Psychological Review*, 65(6), 386-408. https://psycnet.apa.org/doi/10.1037/h0042519 2. McCulloch, W. S., and Pitts, W. (1943). A Logical Calculus of the Ideas Immanent in Nervous Activity. *Bulletin of Mathematical Biophysics*, 5, 115-133. https://link.springer.com/article/10.1007/BF02478259 3. Novikoff, A. B. (1962). On Convergence Proofs for Perceptrons. *Proceedings of the Symposium on the Mathematical Theory of Automata*, 12, 615-622. https://cs.uwaterloo.ca/~y328yu/classics/novikoff.pdf 4. Block, H. D. (1962). The Perceptron: A Model for Brain Functioning. *Reviews of Modern Physics*, 34(1), 123-135. https://journals.aps.org/rmp/abstract/10.1103/RevModPhys.34.123 5. Minsky, M., and Papert, S. (1969). *Perceptrons: An Introduction to Computational Geometry*. MIT Press. https://mitpress.mit.edu/9780262630221/perceptrons/ 6. Rumelhart, D. E., Hinton, G. E., and Williams, R. J. (1986). Learning Representations by Back-Propagating Errors. *Nature*, 323, 533-536. https://www.nature.com/articles/323533a0 7. Gallant, S. I. (1990). Perceptron-Based Learning Algorithms. *IEEE Transactions on Neural Networks*, 1(2), 179-191. https://ieeexplore.ieee.org/document/80230 8. Freund, Y., and Schapire, R. E. (1999). Large Margin Classification Using the Perceptron Algorithm. *Machine Learning*, 37(3), 277-296. https://link.springer.com/article/10.1023/A:1007662407062