184 The Universal Approximation Theorem

The Universal Approximation Theorem (UAT) is the foundational mathematical justification for using neural networks as general purpose function approximators. It tells us that under mild conditions a feedforward network can represent essentially any function we might care about. This chapter states the theorem carefully for both shallow and deep networks, sketches the intuition behind the classical proofs, contrasts the efficiency of width against depth, and clarifies the boundary between what the theorem guarantees and what it leaves silent.

184.1 1. Why the Theorem Matters

When we train a neural network we are searching inside a parametric family of functions for one that fits our data and generalizes. A natural first question is whether that family is rich enough to contain a good answer at all. If the hypothesis class were fundamentally limited, no optimizer and no amount of data could rescue it. The Universal Approximation Theorem answers this question of representational capacity in the affirmative. It separates three concerns that are easy to conflate:

Representability. Does a function close to the target exist in the network family?
Learnability. Can an optimization procedure find it from finite data?
Generalization. Does the found function behave well on unseen inputs?

The UAT speaks only to the first concern. Its value is that it removes representability as an excuse for failure, which sharpens our attention on optimization and generalization, the genuinely hard parts of deep learning.

184.2 2. The Shallow Network Statement

184.2.1 2.1 Setup and Notation

Consider a single hidden layer network mapping $\mathbb{R}^d$ to $\mathbb{R}$. With $N$ hidden units it computes

\[ f_N(x) = \sum_{i=1}^{N} c_i \, \sigma\!\left(w_i^\top x + b_i\right), \]

where $w_i \in \mathbb{R}^d$ are weight vectors, $b_i \in \mathbb{R}$ are biases, $c_i \in \mathbb{R}$ are output weights, and $\sigma$ is a fixed scalar nonlinearity applied componentwise. The set of all such $f_N$ over all $N$ and all parameter choices is the family of single hidden layer networks with activation $\sigma$.

184.2.2 2.2 Two Notions of Distance

Before stating the theorem we fix what “close” means, because the choice of norm changes both the statement and the activation conditions required. Let $K \subset \mathbb{R}^d$ be compact and let $C(K)$ be the space of continuous real functions on $K$.

Uniform (supremum) norm. $\|h\|_\infty = \sup_{x \in K} |h(x)|$. Closeness in this norm means the approximant tracks the target everywhere on $K$ at once, with no exceptions allowed. This is the strongest and most common mode for UAT statements.
$L^p$ norm. $\|h\|_{L^p(\mu)} = \left(\int_K |h(x)|^p \, d\mu(x)\right)^{1/p}$ for a finite measure $\mu$. Closeness here allows large pointwise errors on small sets, which is the natural mode for quantitative rates such as Barron’s bound in Section 5.

We say the network family is dense in a target space $V$ under a norm $\|\cdot\|$ if for every $g \in V$ and every $\varepsilon > 0$ there is a network $f$ with $\|f - g\| < \varepsilon$. Universality is exactly this density claim.

184.2.3 2.3 Cybenko and Hornik

The classical results come in two flavors. Cybenko (1989) proved the theorem for sigmoidal activations, meaning functions with $\sigma(t) \to 1$ as $t \to +\infty$ and $\sigma(t) \to 0$ as $t \to -\infty$. Hornik, Stinchcombe, and White (1989) gave a parallel argument and Hornik (1991) later isolated the essential property.

A clean modern statement is the following.

Theorem (Universal Approximation, shallow form). Let $\sigma : \mathbb{R} \to \mathbb{R}$ be continuous and not a polynomial. Let $K \subset \mathbb{R}^d$ be compact. Then for every continuous target $g \in C(K)$ and every tolerance $\varepsilon > 0$, there exist a width $N \in \mathbb{N}$, weight vectors $w_i \in \mathbb{R}^d$, biases $b_i \in \mathbb{R}$, and output weights $c_i \in \mathbb{R}$ such that the single hidden layer network $f_N(x) = \sum_{i=1}^N c_i \, \sigma(w_i^\top x + b_i)$ satisfies \[ \sup_{x \in K} \left| f_N(x) - g(x) \right| < \varepsilon. \]

In topological language, the set of single hidden layer networks is dense in $C(K)$ under the uniform norm. The non-polynomial condition (Leshno et al., 1993) is the sharp characterization. If $\sigma$ were a polynomial of degree $k$, every $f_N$ would also be a polynomial of degree at most $k$, since a sum of degree $k$ polynomials of affine arguments stays within that degree, and the family could never escape the finite dimensional space of degree $k$ polynomials. Such a family cannot approximate, say, $\cos(x)$ uniformly on a sufficiently long interval. Excluding polynomials is therefore both necessary and sufficient. Most activations used in practice satisfy the condition automatically: the logistic sigmoid, $\tanh$, ReLU, GELU, and the swish are all continuous and non-polynomial.

184.2.4 2.3 What Density Means in Practice

Density is an existence claim. It promises that the approximation error can be driven below any positive threshold, but it places no a priori bound on the width $N$ required. The required width can grow without limit as $\varepsilon \to 0$ or as the target function becomes more oscillatory. We return to this cost in Section 5.

184.3 3. The Intuition of the Proof

There are several routes to the shallow theorem. Two illuminate the mechanism especially well.

184.3.1 3.1 The Constructive Bump Argument

Work first in one dimension with a sigmoidal $\sigma$. A single unit $\sigma(w x + b)$ with large $w$ approximates a step function with a jump near $x = -b/w$. Subtracting two shifted steps,

\[ \sigma\!\left(w(x - a)\right) - \sigma\!\left(w(x - a')\right), \]

yields an approximate bump that is near one on the interval $[a, a']$ and near zero elsewhere. With enough bumps we build a piecewise constant approximation to any continuous $g$, and piecewise constant functions are dense in $C[a, a']$ by uniform continuity. Concretely, partition $[a, a']$ into $M$ subintervals of width $\delta = (a'-a)/M$. On each subinterval pick a representative value of $g$ and place a scaled bump there. Uniform continuity of $g$ guarantees that for any $\varepsilon$ we can choose $\delta$ small enough that $g$ varies by less than $\varepsilon/2$ across each subinterval, and the steepness $w$ large enough that each bump approximates its indicator to within $\varepsilon/2$. The two error budgets sum to less than $\varepsilon$. In higher dimensions one forms products of such bumps along coordinate directions, or projects onto many directions $w_i$, to localize in $\mathbb{R}^d$. This argument is concrete and explains why sigmoids suffice, though the bump count and the resulting width grow quickly with dimension, an early hint of the cost analysis in Section 5.

The mechanism by which two saturating units combine into one localized feature is the heart of the construction:

flowchart LR
    A["Steep sigmoid at a, rising step"] --> C["Subtract"]
    B["Steep sigmoid at a-prime, rising step"] --> C
    C --> D["Localized bump near one on a to a-prime, near zero outside"]

184.3.2 3.2 The Functional Analytic Argument

Cybenko’s own proof is nonconstructive and elegant. Suppose, toward a contradiction, that the closure $S$ of the network family is a proper closed subspace of $C(K)$. By the Hahn-Banach theorem there is a nonzero bounded linear functional that vanishes on $S$. By the Riesz representation theorem this functional corresponds to a signed measure $\mu$ with

\[ \int_K \sigma\!\left(w^\top x + b\right) \, d\mu(x) = 0 \quad \text{for all } w, b. \]

Cybenko shows that a sigmoidal $\sigma$ is discriminatory, meaning that the only finite signed measure $\mu$ for which $\int_K \sigma(w^\top x + b) \, d\mu(x) = 0$ holds for every $w$ and $b$ is the zero measure $\mu = 0$. Granting discriminatoriness, the functional must be zero, contradicting the assumption that it was nonzero. Hence $S = C(K)$ and the family is dense. The proof of discriminatoriness itself exploits the saturation of the sigmoid: as $w$ scales up, $\sigma(w^\top x + b)$ approaches an indicator of a half space, so vanishing integrals against all such limits force $\mu$ to assign zero mass to every half space, which by a uniqueness argument forces $\mu = 0$. This route trades constructive insight for generality and a short proof, and it generalizes cleanly to the non-polynomial characterization of Leshno et al. (1993) and the survey of Pinkus (1999).

184.4 4. Deep Network Statements

The classical theorems fix depth at one hidden layer and let width grow. Modern practice does the opposite, so a complementary family of results fixes width and lets depth grow.

184.4.1 4.1 Bounded Width, Unbounded Depth

A representative result concerns networks with the ReLU activation $\sigma(t) = \max(0, t)$. Lu et al. (2017) and Hanin and Sellke (2017) showed that ReLU networks of width $d + 1$, where $d$ is the input dimension, are universal approximators for continuous functions on compact sets when depth is allowed to grow. Sharper analyses identified critical width thresholds below which universality fails. The intuition is that each layer can preserve and reshape information in a bounded number of channels while depth supplies the expressive power, somewhat like a narrow but very long assembly line that performs many sequential transformations.

A compact summary of the two regimes. The shallow regime fixes depth at one and lets width $N$ grow without bound. The deep regime fixes width near $d+1$ and lets depth $L$ grow without bound. Both regimes achieve uniform density in $C(K)$, so universality is not a property of a single architectural axis but of either one taken to its limit.

184.4.2 4.2 Why Depth Is Natural for ReLU

A ReLU network computes a continuous piecewise linear function, and the number of linear regions it can carve is the natural measure of its expressivity. Composing layers multiplies the region count, so regions can grow exponentially in depth while growing only polynomially in width. This compositional multiplication is the structural reason depth can be more parameter efficient than width, which we make precise next.

184.5 5. Width Versus Depth Efficiency

Universality is qualitative. The quantitative question is how many parameters a given accuracy demands, and here shallow and deep networks diverge sharply.

184.5.1 5.1 The Cost of Shallow Approximation

For shallow networks the required width often scales badly with dimension and smoothness. For a general Lipschitz target on $[0,1]^d$ the number of hidden units needed to reach error $\varepsilon$ can scale like $\varepsilon^{-d}$, an instance of the curse of dimensionality. Barron (1993) provided an important refinement. For functions whose Fourier transform satisfies an integrability condition, captured by the Barron norm $C_g$, a shallow network achieves

\[ \| f_N - g \|_{L^2}^2 \le \frac{C_g^2}{N}, \]

a rate of $O(N^{-1/2})$ that is independent of dimension. The lesson is nuanced. Shallow networks are efficient for the special class of low Barron norm functions but can be hopeless for generic high dimensional targets.

184.5.2 5.2 Depth Separations

A depth separation is a theorem exhibiting a function that a deep network represents with few parameters but that any shallow network requires exponentially many parameters to approximate. Telgarsky (2016) constructed such functions using the rapidly oscillating triangle, or sawtooth, map. A ReLU network of depth $L$ can produce a sawtooth with $2^L$ oscillations using $O(L)$ units, because composing the simple triangle map doubles the oscillation count each time:

\[ T(x) = 2 \min(x, 1 - x), \qquad T^{(L)} \text{ has } 2^{L-1} \text{ teeth.} \]

Matching that oscillation count with a shallow network requires on the order of $2^L$ units, since a single layer must place each tooth independently. This is a provable exponential gap in favor of depth for this function class.

184.5.3 5.3 A Worked Example: The Sawtooth by Composition

To make the depth advantage concrete, trace the triangle map $T(x) = 2 \min(x, 1-x)$ on the unit interval. A single ReLU layer realizes it exactly, since $\min(x, 1-x) = x - \mathrm{ReLU}(2x - 1)$ after rescaling, so $T$ costs a constant number of units. Now compose:

$T^{(1)} = T$ has one tooth, a single rise and fall, peaking at $x = 1/2$.
$T^{(2)} = T \circ T$ folds the interval again. Each rising and each falling segment of the first map gets a full copy of $T$ stacked on it, doubling the count to two teeth.
In general $T^{(L)}$ has $2^{L-1}$ teeth, because each composition reflects and repeats the existing pattern, doubling the number of monotone pieces.

A depth $L$ ReLU network implements $T^{(L)}$ with $O(L)$ total units, since each layer adds one composition. To represent the same $2^{L-1}$ oscillations, a single hidden layer network must change slope at every one of the roughly $2^L$ breakpoints, and a ReLU unit contributes at most one new breakpoint, so width on the order of $2^L$ is forced. Setting $L = 30$ already separates roughly thirty units from over a billion. The compositional structure of the sawtooth is exactly the structure depth exploits, and it is no accident that hierarchical real world signals share that flavor.

184.5.4 5.4 The Practical Reading

These separations explain a robust empirical pattern. Depth buys exponential expressivity for compositional and hierarchical targets, which is precisely the structure of images, language, and most natural signals, where features build on features. Width remains valuable for capacity and for representing functions that are wide but shallow in structure. Real architectures balance both, and the theorems tell us neither dimension is dispensable.

184.6 6. What the Theorem Does and Does Not Promise

It is easy to over read the UAT, so we state its limits explicitly.

184.6.1 6.1 What It Promises

Existence of a good approximant. Some network in the family comes uniformly close to the target on any compact set.
A sharp activation condition. Non-polynomial continuity is exactly what is needed in the shallow case.
Two complementary regimes. Universality holds both by growing width at fixed depth and by growing depth at fixed width.

184.6.2 6.2 What It Does Not Promise

No width or depth bound for free. The theorem is silent on size in its base form. Quantitative rates require extra assumptions such as a finite Barron norm or known smoothness.
No optimization guarantee. Existence of good weights says nothing about whether gradient descent will find them. The loss landscape is nonconvex and a representable function may be unreachable from a given initialization.
No generalization guarantee. Fitting a target on a compact set, or interpolating training points, does not imply good behavior on new inputs. Generalization is governed by capacity control, regularization, and data, not by representability.
Nothing outside the compact set. Approximation is uniform only on $K$. Extrapolation beyond the region seen in training is wholly unconstrained, which is why neural networks routinely behave unpredictably out of distribution.
Continuity is assumed. The standard statements target continuous functions. Discontinuous or pathological targets need separate treatment and weaker modes of convergence.

184.6.3 6.3 A Mental Model

The right way to hold the theorem is as a statement about the reachable set, not the learning algorithm. It guarantees the destination exists on the map. It says nothing about whether the road is paved, whether your vehicle can get there, or whether the destination generalizes to where you actually need to go. This is liberating and humbling at once. We never have to wonder whether the architecture is expressive enough in principle, but we must accept that every interesting difficulty in deep learning lives in the gap between representability and the two harder problems the theorem deliberately ignores.

184.6.4 6.4 How to Use the Theorem in Practice

The UAT is a reasoning tool, not a design recipe. A few concrete guidelines follow from it.

Stop blaming the architecture for fit failures. If a sufficiently sized network underfits the training set, the theorem says the cause is almost never representational. Look instead at the optimizer, the learning rate, initialization, normalization, or a bug in the loss, not at the activation choice or layer count.
Match structure to the target. The depth separations argue for depth when the target is compositional or hierarchical, which describes most perceptual and language tasks. Width remains the lever for raw capacity and for targets that are wide but shallow in structure.
Treat the non-polynomial condition as essentially free. Every mainstream activation already satisfies it. The condition matters mainly as a cautionary note: a purely polynomial activation would silently cap expressivity.
Do not read universality as a guarantee on test data. The theorem lives entirely on a compact training region. Out of distribution inputs and extrapolation are unconstrained, so robustness and generalization need their own machinery (regularization, data coverage, and capacity control), addressed in their own chapters.

These ideas are easy to inspect numerically with mature open source tools. A short loop in numpy that subtracts two steep tanh units reproduces the bump of Section 3.1, and a small scikit-learn MLPRegressor or a pytorch module fit to the sawtooth $T^{(L)}$ makes the depth advantage of Section 5.3 visible as a shallow network struggling to place every tooth.

184.7 7. Summary

The Universal Approximation Theorem establishes that feedforward networks with non-polynomial activations are dense in the space of continuous functions on compact sets, whether we grow width at fixed depth or depth at fixed width. The proofs proceed either by constructing localized bumps from pairs of saturating units or by a Hahn-Banach and Riesz argument that turns density into a statement about discriminatory activations. Quantitatively, shallow networks face the curse of dimensionality except on restricted classes such as Barron functions, while depth provides provable exponential gains for compositional targets through region multiplication and sawtooth separations. The theorem promises representability and nothing more, leaving optimization, generalization, and extrapolation as the substantive challenges that define modern practice.

184.8 References

Cybenko, G. (1989). Approximation by superpositions of a sigmoidal function. Mathematics of Control, Signals and Systems, 2(4), 303-314. https://link.springer.com/article/10.1007/BF02551274
Hornik, K., Stinchcombe, M., & White, H. (1989). Multilayer feedforward networks are universal approximators. Neural Networks, 2(5), 359-366. https://www.sciencedirect.com/science/article/abs/pii/0893608089900208
Hornik, K. (1991). Approximation capabilities of multilayer feedforward networks. Neural Networks, 4(2), 251-257. https://www.sciencedirect.com/science/article/abs/pii/089360809190009T
Leshno, M., Lin, V. Y., Pinkus, A., & Schocken, S. (1993). Multilayer feedforward networks with a nonpolynomial activation function can approximate any function. Neural Networks, 6(6), 861-867. https://www.sciencedirect.com/science/article/abs/pii/S0893608005801315
Barron, A. R. (1993). Universal approximation bounds for superpositions of a sigmoidal function. IEEE Transactions on Information Theory, 39(3), 930-945. https://ieeexplore.ieee.org/document/256500
Telgarsky, M. (2016). Benefits of depth in neural networks. Conference on Learning Theory (COLT). https://proceedings.mlr.press/v49/telgarsky16.html
Lu, Z., Pu, H., Wang, F., Hu, Z., & Wang, L. (2017). The expressive power of neural networks: A view from the width. Advances in Neural Information Processing Systems (NeurIPS). https://papers.nips.cc/paper/2017/hash/32cbf687880eb1674a07bf717761dd3a-Abstract.html
Hanin, B., & Sellke, M. (2017). Approximating continuous functions by ReLU nets of minimal width. https://arxiv.org/abs/1710.11278
Pinkus, A. (1999). Approximation theory of the MLP model in neural networks. Acta Numerica, 8, 143-195. https://www.cambridge.org/core/journals/acta-numerica/article/abs/approximation-theory-of-the-mlp-model-in-neural-networks/18072C558C8410C4F92A82BCC8FC8CF9

# The Universal Approximation Theorem The Universal Approximation Theorem (UAT) is the foundational mathematical justification for using neural networks as general purpose function approximators. It tells us that under mild conditions a feedforward network can represent essentially any function we might care about. This chapter states the theorem carefully for both shallow and deep networks, sketches the intuition behind the classical proofs, contrasts the efficiency of width against depth, and clarifies the boundary between what the theorem guarantees and what it leaves silent. ## 1. Why the Theorem Matters When we train a neural network we are searching inside a parametric family of functions for one that fits our data and generalizes. A natural first question is whether that family is rich enough to contain a good answer at all. If the hypothesis class were fundamentally limited, no optimizer and no amount of data could rescue it. The Universal Approximation Theorem answers this question of representational capacity in the affirmative. It separates three concerns that are easy to conflate: 1. **Representability.** Does a function close to the target exist in the network family? 2. **Learnability.** Can an optimization procedure find it from finite data? 3. **Generalization.** Does the found function behave well on unseen inputs? The UAT speaks only to the first concern. Its value is that it removes representability as an excuse for failure, which sharpens our attention on optimization and generalization, the genuinely hard parts of deep learning. ## 2. The Shallow Network Statement ### 2.1 Setup and Notation Consider a single hidden layer network mapping $\mathbb{R}^d$ to $\mathbb{R}$. With $N$ hidden units it computes $$ f_N(x) = \sum_{i=1}^{N} c_i \, \sigma\!\left(w_i^\top x + b_i\right), $$ where $w_i \in \mathbb{R}^d$ are weight vectors, $b_i \in \mathbb{R}$ are biases, $c_i \in \mathbb{R}$ are output weights, and $\sigma$ is a fixed scalar nonlinearity applied componentwise. The set of all such $f_N$ over all $N$ and all parameter choices is the family of single hidden layer networks with activation $\sigma$. ### 2.2 Two Notions of Distance Before stating the theorem we fix what "close" means, because the choice of norm changes both the statement and the activation conditions required. Let $K \subset \mathbb{R}^d$ be compact and let $C(K)$ be the space of continuous real functions on $K$. * **Uniform (supremum) norm.** $\|h\|_\infty = \sup_{x \in K} |h(x)|$. Closeness in this norm means the approximant tracks the target everywhere on $K$ at once, with no exceptions allowed. This is the strongest and most common mode for UAT statements. * **$L^p$ norm.** $\|h\|_{L^p(\mu)} = \left(\int_K |h(x)|^p \, d\mu(x)\right)^{1/p}$ for a finite measure $\mu$. Closeness here allows large pointwise errors on small sets, which is the natural mode for quantitative rates such as Barron's bound in Section 5. We say the network family is **dense** in a target space $V$ under a norm $\|\cdot\|$ if for every $g \in V$ and every $\varepsilon > 0$ there is a network $f$ with $\|f - g\| < \varepsilon$. Universality is exactly this density claim. ### 2.3 Cybenko and Hornik The classical results come in two flavors. Cybenko (1989) proved the theorem for **sigmoidal** activations, meaning functions with $\sigma(t) \to 1$ as $t \to +\infty$ and $\sigma(t) \to 0$ as $t \to -\infty$. Hornik, Stinchcombe, and White (1989) gave a parallel argument and Hornik (1991) later isolated the essential property. A clean modern statement is the following. ::: {.callout-note appearance="simple"} **Theorem (Universal Approximation, shallow form).** Let $\sigma : \mathbb{R} \to \mathbb{R}$ be continuous and **not a polynomial**. Let $K \subset \mathbb{R}^d$ be compact. Then for every continuous target $g \in C(K)$ and every tolerance $\varepsilon > 0$, there exist a width $N \in \mathbb{N}$, weight vectors $w_i \in \mathbb{R}^d$, biases $b_i \in \mathbb{R}$, and output weights $c_i \in \mathbb{R}$ such that the single hidden layer network $f_N(x) = \sum_{i=1}^N c_i \, \sigma(w_i^\top x + b_i)$ satisfies $$ \sup_{x \in K} \left| f_N(x) - g(x) \right| < \varepsilon. $$ ::: In topological language, the set of single hidden layer networks is **dense** in $C(K)$ under the uniform norm. The non-polynomial condition (Leshno et al., 1993) is the sharp characterization. If $\sigma$ were a polynomial of degree $k$, every $f_N$ would also be a polynomial of degree at most $k$, since a sum of degree $k$ polynomials of affine arguments stays within that degree, and the family could never escape the finite dimensional space of degree $k$ polynomials. Such a family cannot approximate, say, $\cos(x)$ uniformly on a sufficiently long interval. Excluding polynomials is therefore both necessary and sufficient. Most activations used in practice satisfy the condition automatically: the logistic sigmoid, $\tanh$, ReLU, GELU, and the swish are all continuous and non-polynomial. ### 2.3 What Density Means in Practice Density is an existence claim. It promises that the approximation error can be driven below any positive threshold, but it places no a priori bound on the width $N$ required. The required width can grow without limit as $\varepsilon \to 0$ or as the target function becomes more oscillatory. We return to this cost in Section 5. ## 3. The Intuition of the Proof There are several routes to the shallow theorem. Two illuminate the mechanism especially well. ### 3.1 The Constructive Bump Argument Work first in one dimension with a sigmoidal $\sigma$. A single unit $\sigma(w x + b)$ with large $w$ approximates a step function with a jump near $x = -b/w$. Subtracting two shifted steps, $$ \sigma\!\left(w(x - a)\right) - \sigma\!\left(w(x - a')\right), $$ yields an approximate **bump** that is near one on the interval $[a, a']$ and near zero elsewhere. With enough bumps we build a piecewise constant approximation to any continuous $g$, and piecewise constant functions are dense in $C[a, a']$ by uniform continuity. Concretely, partition $[a, a']$ into $M$ subintervals of width $\delta = (a'-a)/M$. On each subinterval pick a representative value of $g$ and place a scaled bump there. Uniform continuity of $g$ guarantees that for any $\varepsilon$ we can choose $\delta$ small enough that $g$ varies by less than $\varepsilon/2$ across each subinterval, and the steepness $w$ large enough that each bump approximates its indicator to within $\varepsilon/2$. The two error budgets sum to less than $\varepsilon$. In higher dimensions one forms products of such bumps along coordinate directions, or projects onto many directions $w_i$, to localize in $\mathbb{R}^d$. This argument is concrete and explains why sigmoids suffice, though the bump count and the resulting width grow quickly with dimension, an early hint of the cost analysis in Section 5. The mechanism by which two saturating units combine into one localized feature is the heart of the construction: ```{mermaid} flowchart LR A["Steep sigmoid at a, rising step"] --> C["Subtract"] B["Steep sigmoid at a-prime, rising step"] --> C C --> D["Localized bump near one on a to a-prime, near zero outside"] ``` ### 3.2 The Functional Analytic Argument Cybenko's own proof is nonconstructive and elegant. Suppose, toward a contradiction, that the closure $S$ of the network family is a proper closed subspace of $C(K)$. By the Hahn-Banach theorem there is a nonzero bounded linear functional that vanishes on $S$. By the Riesz representation theorem this functional corresponds to a signed measure $\mu$ with $$ \int_K \sigma\!\left(w^\top x + b\right) \, d\mu(x) = 0 \quad \text{for all } w, b. $$ Cybenko shows that a sigmoidal $\sigma$ is **discriminatory**, meaning that the only finite signed measure $\mu$ for which $\int_K \sigma(w^\top x + b) \, d\mu(x) = 0$ holds for every $w$ and $b$ is the zero measure $\mu = 0$. Granting discriminatoriness, the functional must be zero, contradicting the assumption that it was nonzero. Hence $S = C(K)$ and the family is dense. The proof of discriminatoriness itself exploits the saturation of the sigmoid: as $w$ scales up, $\sigma(w^\top x + b)$ approaches an indicator of a half space, so vanishing integrals against all such limits force $\mu$ to assign zero mass to every half space, which by a uniqueness argument forces $\mu = 0$. This route trades constructive insight for generality and a short proof, and it generalizes cleanly to the non-polynomial characterization of Leshno et al. (1993) and the survey of Pinkus (1999). ## 4. Deep Network Statements The classical theorems fix depth at one hidden layer and let width grow. Modern practice does the opposite, so a complementary family of results fixes width and lets depth grow. ### 4.1 Bounded Width, Unbounded Depth A representative result concerns networks with the ReLU activation $\sigma(t) = \max(0, t)$. Lu et al. (2017) and Hanin and Sellke (2017) showed that ReLU networks of **width** $d + 1$, where $d$ is the input dimension, are universal approximators for continuous functions on compact sets when depth is allowed to grow. Sharper analyses identified critical width thresholds below which universality fails. The intuition is that each layer can preserve and reshape information in a bounded number of channels while depth supplies the expressive power, somewhat like a narrow but very long assembly line that performs many sequential transformations. A compact summary of the two regimes. The shallow regime fixes depth at one and lets width $N$ grow without bound. The deep regime fixes width near $d+1$ and lets depth $L$ grow without bound. Both regimes achieve uniform density in $C(K)$, so universality is not a property of a single architectural axis but of either one taken to its limit. ### 4.2 Why Depth Is Natural for ReLU A ReLU network computes a continuous **piecewise linear** function, and the number of linear regions it can carve is the natural measure of its expressivity. Composing layers multiplies the region count, so regions can grow exponentially in depth while growing only polynomially in width. This compositional multiplication is the structural reason depth can be more parameter efficient than width, which we make precise next. ## 5. Width Versus Depth Efficiency Universality is qualitative. The quantitative question is how many parameters a given accuracy demands, and here shallow and deep networks diverge sharply. ### 5.1 The Cost of Shallow Approximation For shallow networks the required width often scales badly with dimension and smoothness. For a general Lipschitz target on $[0,1]^d$ the number of hidden units needed to reach error $\varepsilon$ can scale like $\varepsilon^{-d}$, an instance of the curse of dimensionality. Barron (1993) provided an important refinement. For functions whose Fourier transform satisfies an integrability condition, captured by the Barron norm $C_g$, a shallow network achieves $$ \| f_N - g \|_{L^2}^2 \le \frac{C_g^2}{N}, $$ a rate of $O(N^{-1/2})$ that is **independent of dimension**. The lesson is nuanced. Shallow networks are efficient for the special class of low Barron norm functions but can be hopeless for generic high dimensional targets. ### 5.2 Depth Separations A depth separation is a theorem exhibiting a function that a deep network represents with few parameters but that any shallow network requires exponentially many parameters to approximate. Telgarsky (2016) constructed such functions using the rapidly oscillating triangle, or sawtooth, map. A ReLU network of depth $L$ can produce a sawtooth with $2^L$ oscillations using $O(L)$ units, because composing the simple triangle map doubles the oscillation count each time: $$ T(x) = 2 \min(x, 1 - x), \qquad T^{(L)} \text{ has } 2^{L-1} \text{ teeth.} $$ Matching that oscillation count with a shallow network requires on the order of $2^L$ units, since a single layer must place each tooth independently. This is a provable exponential gap in favor of depth for this function class. ### 5.3 A Worked Example: The Sawtooth by Composition To make the depth advantage concrete, trace the triangle map $T(x) = 2 \min(x, 1-x)$ on the unit interval. A single ReLU layer realizes it exactly, since $\min(x, 1-x) = x - \mathrm{ReLU}(2x - 1)$ after rescaling, so $T$ costs a constant number of units. Now compose: * $T^{(1)} = T$ has one tooth, a single rise and fall, peaking at $x = 1/2$. * $T^{(2)} = T \circ T$ folds the interval again. Each rising and each falling segment of the first map gets a full copy of $T$ stacked on it, doubling the count to two teeth. * In general $T^{(L)}$ has $2^{L-1}$ teeth, because each composition reflects and repeats the existing pattern, doubling the number of monotone pieces. A depth $L$ ReLU network implements $T^{(L)}$ with $O(L)$ total units, since each layer adds one composition. To represent the same $2^{L-1}$ oscillations, a single hidden layer network must change slope at every one of the roughly $2^L$ breakpoints, and a ReLU unit contributes at most one new breakpoint, so width on the order of $2^L$ is forced. Setting $L = 30$ already separates roughly thirty units from over a billion. The compositional structure of the sawtooth is exactly the structure depth exploits, and it is no accident that hierarchical real world signals share that flavor. ### 5.4 The Practical Reading These separations explain a robust empirical pattern. Depth buys exponential expressivity for compositional and hierarchical targets, which is precisely the structure of images, language, and most natural signals, where features build on features. Width remains valuable for capacity and for representing functions that are wide but shallow in structure. Real architectures balance both, and the theorems tell us neither dimension is dispensable. ## 6. What the Theorem Does and Does Not Promise It is easy to over read the UAT, so we state its limits explicitly. ### 6.1 What It Promises * **Existence of a good approximant.** Some network in the family comes uniformly close to the target on any compact set. * **A sharp activation condition.** Non-polynomial continuity is exactly what is needed in the shallow case. * **Two complementary regimes.** Universality holds both by growing width at fixed depth and by growing depth at fixed width. ### 6.2 What It Does Not Promise * **No width or depth bound for free.** The theorem is silent on size in its base form. Quantitative rates require extra assumptions such as a finite Barron norm or known smoothness. * **No optimization guarantee.** Existence of good weights says nothing about whether gradient descent will find them. The loss landscape is nonconvex and a representable function may be unreachable from a given initialization. * **No generalization guarantee.** Fitting a target on a compact set, or interpolating training points, does not imply good behavior on new inputs. Generalization is governed by capacity control, regularization, and data, not by representability. * **Nothing outside the compact set.** Approximation is uniform only on $K$. Extrapolation beyond the region seen in training is wholly unconstrained, which is why neural networks routinely behave unpredictably out of distribution. * **Continuity is assumed.** The standard statements target continuous functions. Discontinuous or pathological targets need separate treatment and weaker modes of convergence. ### 6.3 A Mental Model The right way to hold the theorem is as a statement about the **reachable set**, not the **learning algorithm**. It guarantees the destination exists on the map. It says nothing about whether the road is paved, whether your vehicle can get there, or whether the destination generalizes to where you actually need to go. This is liberating and humbling at once. We never have to wonder whether the architecture is expressive enough in principle, but we must accept that every interesting difficulty in deep learning lives in the gap between representability and the two harder problems the theorem deliberately ignores. ### 6.4 How to Use the Theorem in Practice The UAT is a reasoning tool, not a design recipe. A few concrete guidelines follow from it. * **Stop blaming the architecture for fit failures.** If a sufficiently sized network underfits the training set, the theorem says the cause is almost never representational. Look instead at the optimizer, the learning rate, initialization, normalization, or a bug in the loss, not at the activation choice or layer count. * **Match structure to the target.** The depth separations argue for depth when the target is compositional or hierarchical, which describes most perceptual and language tasks. Width remains the lever for raw capacity and for targets that are wide but shallow in structure. * **Treat the non-polynomial condition as essentially free.** Every mainstream activation already satisfies it. The condition matters mainly as a cautionary note: a purely polynomial activation would silently cap expressivity. * **Do not read universality as a guarantee on test data.** The theorem lives entirely on a compact training region. Out of distribution inputs and extrapolation are unconstrained, so robustness and generalization need their own machinery (regularization, data coverage, and capacity control), addressed in their own chapters. These ideas are easy to inspect numerically with mature open source tools. A short loop in `numpy` that subtracts two steep `tanh` units reproduces the bump of Section 3.1, and a small `scikit-learn` MLPRegressor or a `pytorch` module fit to the sawtooth $T^{(L)}$ makes the depth advantage of Section 5.3 visible as a shallow network struggling to place every tooth. ## 7. Summary The Universal Approximation Theorem establishes that feedforward networks with non-polynomial activations are dense in the space of continuous functions on compact sets, whether we grow width at fixed depth or depth at fixed width. The proofs proceed either by constructing localized bumps from pairs of saturating units or by a Hahn-Banach and Riesz argument that turns density into a statement about discriminatory activations. Quantitatively, shallow networks face the curse of dimensionality except on restricted classes such as Barron functions, while depth provides provable exponential gains for compositional targets through region multiplication and sawtooth separations. The theorem promises representability and nothing more, leaving optimization, generalization, and extrapolation as the substantive challenges that define modern practice. ## References 1. Cybenko, G. (1989). Approximation by superpositions of a sigmoidal function. *Mathematics of Control, Signals and Systems*, 2(4), 303-314. https://link.springer.com/article/10.1007/BF02551274 2. Hornik, K., Stinchcombe, M., & White, H. (1989). Multilayer feedforward networks are universal approximators. *Neural Networks*, 2(5), 359-366. https://www.sciencedirect.com/science/article/abs/pii/0893608089900208 3. Hornik, K. (1991). Approximation capabilities of multilayer feedforward networks. *Neural Networks*, 4(2), 251-257. https://www.sciencedirect.com/science/article/abs/pii/089360809190009T 4. Leshno, M., Lin, V. Y., Pinkus, A., & Schocken, S. (1993). Multilayer feedforward networks with a nonpolynomial activation function can approximate any function. *Neural Networks*, 6(6), 861-867. https://www.sciencedirect.com/science/article/abs/pii/S0893608005801315 5. Barron, A. R. (1993). Universal approximation bounds for superpositions of a sigmoidal function. *IEEE Transactions on Information Theory*, 39(3), 930-945. https://ieeexplore.ieee.org/document/256500 6. Telgarsky, M. (2016). Benefits of depth in neural networks. *Conference on Learning Theory (COLT)*. https://proceedings.mlr.press/v49/telgarsky16.html 7. Lu, Z., Pu, H., Wang, F., Hu, Z., & Wang, L. (2017). The expressive power of neural networks: A view from the width. *Advances in Neural Information Processing Systems (NeurIPS)*. https://papers.nips.cc/paper/2017/hash/32cbf687880eb1674a07bf717761dd3a-Abstract.html 8. Hanin, B., & Sellke, M. (2017). Approximating continuous functions by ReLU nets of minimal width. https://arxiv.org/abs/1710.11278 9. Pinkus, A. (1999). Approximation theory of the MLP model in neural networks. *Acta Numerica*, 8, 143-195. https://www.cambridge.org/core/journals/acta-numerica/article/abs/approximation-theory-of-the-mlp-model-in-neural-networks/18072C558C8410C4F92A82BCC8FC8CF9