184  The Universal Approximation Theorem

The Universal Approximation Theorem (UAT) is the foundational mathematical justification for using neural networks as general purpose function approximators. It tells us that under mild conditions a feedforward network can represent essentially any function we might care about. This chapter states the theorem carefully for both shallow and deep networks, sketches the intuition behind the classical proofs, contrasts the efficiency of width against depth, and clarifies the boundary between what the theorem guarantees and what it leaves silent.

184.1 1. Why the Theorem Matters

When we train a neural network we are searching inside a parametric family of functions for one that fits our data and generalizes. A natural first question is whether that family is rich enough to contain a good answer at all. If the hypothesis class were fundamentally limited, no optimizer and no amount of data could rescue it. The Universal Approximation Theorem answers this question of representational capacity in the affirmative. It separates three concerns that are easy to conflate:

  1. Representability. Does a function close to the target exist in the network family?
  2. Learnability. Can an optimization procedure find it from finite data?
  3. Generalization. Does the found function behave well on unseen inputs?

The UAT speaks only to the first concern. Its value is that it removes representability as an excuse for failure, which sharpens our attention on optimization and generalization, the genuinely hard parts of deep learning.

184.2 2. The Shallow Network Statement

184.2.1 2.1 Setup and Notation

Consider a single hidden layer network mapping \(\mathbb{R}^d\) to \(\mathbb{R}\). With \(N\) hidden units it computes

\[ f_N(x) = \sum_{i=1}^{N} c_i \, \sigma\!\left(w_i^\top x + b_i\right), \]

where \(w_i \in \mathbb{R}^d\) are weight vectors, \(b_i \in \mathbb{R}\) are biases, \(c_i \in \mathbb{R}\) are output weights, and \(\sigma\) is a fixed scalar nonlinearity applied componentwise. The set of all such \(f_N\) over all \(N\) and all parameter choices is the family of single hidden layer networks with activation \(\sigma\).

184.2.2 2.2 Cybenko and Hornik

The classical results come in two flavors. Cybenko (1989) proved the theorem for sigmoidal activations, meaning functions with \(\sigma(t) \to 1\) as \(t \to +\infty\) and \(\sigma(t) \to 0\) as \(t \to -\infty\). Hornik, Stinchcombe, and White (1989) gave a parallel argument and Hornik (1991) later isolated the essential property.

A clean modern statement is the following. Let \(\sigma\) be continuous and not a polynomial. Then for any continuous target \(g\) on a compact set \(K \subset \mathbb{R}^d\) and any tolerance \(\varepsilon > 0\), there exists a width \(N\) and parameters such that

\[ \sup_{x \in K} \left| f_N(x) - g(x) \right| < \varepsilon. \]

In topological language, the set of single hidden layer networks is dense in \(C(K)\) under the uniform norm. The non-polynomial condition (Leshno et al., 1993) is the sharp characterization. If \(\sigma\) were a polynomial of degree \(k\), every \(f_N\) would also be a polynomial of degree at most \(k\), and such functions cannot approximate, say, \(\cos(x)\) uniformly on the real line. Excluding polynomials is therefore both necessary and sufficient.

184.2.3 2.3 What Density Means in Practice

Density is an existence claim. It promises that the approximation error can be driven below any positive threshold, but it places no a priori bound on the width \(N\) required. The required width can grow without limit as \(\varepsilon \to 0\) or as the target function becomes more oscillatory. We return to this cost in Section 5.

184.3 3. The Intuition of the Proof

There are several routes to the shallow theorem. Two illuminate the mechanism especially well.

184.3.1 3.1 The Constructive Bump Argument

Work first in one dimension with a sigmoidal \(\sigma\). A single unit \(\sigma(w x + b)\) with large \(w\) approximates a step function with a jump near \(x = -b/w\). Subtracting two shifted steps,

\[ \sigma\!\left(w(x - a)\right) - \sigma\!\left(w(x - a')\right), \]

yields an approximate bump that is near one on the interval \([a, a']\) and near zero elsewhere. With enough bumps we build a piecewise constant approximation to any continuous \(g\), and piecewise constant functions are dense in \(C[a, a']\) by uniform continuity. In higher dimensions one forms products of such bumps along coordinate directions, or projects onto many directions \(w_i\), to localize in \(\mathbb{R}^d\). This argument is concrete and explains why sigmoids suffice, though the bump count and the resulting width grow quickly with dimension.

two opposing sigmoids form one localized bump
   _____                      _______
  /          minus    \      =        / \
 /                      \____         /   \____

184.3.2 3.2 The Functional Analytic Argument

Cybenko’s own proof is nonconstructive and elegant. Suppose, toward a contradiction, that the closure \(S\) of the network family is a proper closed subspace of \(C(K)\). By the Hahn-Banach theorem there is a nonzero bounded linear functional that vanishes on \(S\). By the Riesz representation theorem this functional corresponds to a signed measure \(\mu\) with

\[ \int_K \sigma\!\left(w^\top x + b\right) \, d\mu(x) = 0 \quad \text{for all } w, b. \]

Cybenko shows that a sigmoidal \(\sigma\) is discriminatory, meaning the only measure satisfying this is \(\mu = 0\), contradicting the assumption that the functional was nonzero. Hence \(S = C(K)\) and the family is dense. This route trades constructive insight for generality and a short proof.

184.4 4. Deep Network Statements

The classical theorems fix depth at one hidden layer and let width grow. Modern practice does the opposite, so a complementary family of results fixes width and lets depth grow.

184.4.1 4.1 Bounded Width, Unbounded Depth

A representative result concerns networks with the ReLU activation \(\sigma(t) = \max(0, t)\). Lu et al. (2017) and Hanin and Sellke (2017) showed that ReLU networks of width \(d + 1\), where \(d\) is the input dimension, are universal approximators for continuous functions on compact sets when depth is allowed to grow. Sharper analyses identified critical width thresholds below which universality fails. The intuition is that each layer can preserve and reshape information in a bounded number of channels while depth supplies the expressive power, somewhat like a narrow but very long assembly line that performs many sequential transformations.

A compact summary of the two regimes:

shallow:  fix depth = 1,  let width  N -> infinity
deep:     fix width = d+1, let depth  L -> infinity
both regimes achieve uniform density in C(K)

184.4.2 4.2 Why Depth Is Natural for ReLU

A ReLU network computes a continuous piecewise linear function, and the number of linear regions it can carve is the natural measure of its expressivity. Composing layers multiplies the region count, so regions can grow exponentially in depth while growing only polynomially in width. This compositional multiplication is the structural reason depth can be more parameter efficient than width, which we make precise next.

184.5 5. Width Versus Depth Efficiency

Universality is qualitative. The quantitative question is how many parameters a given accuracy demands, and here shallow and deep networks diverge sharply.

184.5.1 5.1 The Cost of Shallow Approximation

For shallow networks the required width often scales badly with dimension and smoothness. For a general Lipschitz target on \([0,1]^d\) the number of hidden units needed to reach error \(\varepsilon\) can scale like \(\varepsilon^{-d}\), an instance of the curse of dimensionality. Barron (1993) provided an important refinement. For functions whose Fourier transform satisfies an integrability condition, captured by the Barron norm \(C_g\), a shallow network achieves

\[ \| f_N - g \|_{L^2}^2 \le \frac{C_g^2}{N}, \]

a rate of \(O(N^{-1/2})\) that is independent of dimension. The lesson is nuanced. Shallow networks are efficient for the special class of low Barron norm functions but can be hopeless for generic high dimensional targets.

184.5.2 5.2 Depth Separations

A depth separation is a theorem exhibiting a function that a deep network represents with few parameters but that any shallow network requires exponentially many parameters to approximate. Telgarsky (2016) constructed such functions using the rapidly oscillating triangle, or sawtooth, map. A ReLU network of depth \(L\) can produce a sawtooth with \(2^L\) oscillations using \(O(L)\) units, because composing the simple triangle map doubles the oscillation count each time:

\[ T(x) = 2 \min(x, 1 - x), \qquad T^{(L)} \text{ has } 2^{L-1} \text{ teeth.} \]

Matching that oscillation count with a shallow network requires on the order of \(2^L\) units, since a single layer must place each tooth independently. This is a provable exponential gap in favor of depth for this function class.

184.5.3 5.3 The Practical Reading

These separations explain a robust empirical pattern. Depth buys exponential expressivity for compositional and hierarchical targets, which is precisely the structure of images, language, and most natural signals, where features build on features. Width remains valuable for capacity and for representing functions that are wide but shallow in structure. Real architectures balance both, and the theorems tell us neither dimension is dispensable.

184.6 6. What the Theorem Does and Does Not Promise

It is easy to over read the UAT, so we state its limits explicitly.

184.6.1 6.1 What It Promises

  • Existence of a good approximant. Some network in the family comes uniformly close to the target on any compact set.
  • A sharp activation condition. Non-polynomial continuity is exactly what is needed in the shallow case.
  • Two complementary regimes. Universality holds both by growing width at fixed depth and by growing depth at fixed width.

184.6.2 6.2 What It Does Not Promise

  • No width or depth bound for free. The theorem is silent on size in its base form. Quantitative rates require extra assumptions such as a finite Barron norm or known smoothness.
  • No optimization guarantee. Existence of good weights says nothing about whether gradient descent will find them. The loss landscape is nonconvex and a representable function may be unreachable from a given initialization.
  • No generalization guarantee. Fitting a target on a compact set, or interpolating training points, does not imply good behavior on new inputs. Generalization is governed by capacity control, regularization, and data, not by representability.
  • Nothing outside the compact set. Approximation is uniform only on \(K\). Extrapolation beyond the region seen in training is wholly unconstrained, which is why neural networks routinely behave unpredictably out of distribution.
  • Continuity is assumed. The standard statements target continuous functions. Discontinuous or pathological targets need separate treatment and weaker modes of convergence.

184.6.3 6.3 A Mental Model

The right way to hold the theorem is as a statement about the reachable set, not the learning algorithm. It guarantees the destination exists on the map. It says nothing about whether the road is paved, whether your vehicle can get there, or whether the destination generalizes to where you actually need to go. This is liberating and humbling at once. We never have to wonder whether the architecture is expressive enough in principle, but we must accept that every interesting difficulty in deep learning lives in the gap between representability and the two harder problems the theorem deliberately ignores.

184.7 7. Summary

The Universal Approximation Theorem establishes that feedforward networks with non-polynomial activations are dense in the space of continuous functions on compact sets, whether we grow width at fixed depth or depth at fixed width. The proofs proceed either by constructing localized bumps from pairs of saturating units or by a Hahn-Banach and Riesz argument that turns density into a statement about discriminatory activations. Quantitatively, shallow networks face the curse of dimensionality except on restricted classes such as Barron functions, while depth provides provable exponential gains for compositional targets through region multiplication and sawtooth separations. The theorem promises representability and nothing more, leaving optimization, generalization, and extrapolation as the substantive challenges that define modern practice.

184.8 References

  1. Cybenko, G. (1989). Approximation by superpositions of a sigmoidal function. Mathematics of Control, Signals and Systems, 2(4), 303-314. https://link.springer.com/article/10.1007/BF02551274
  2. Hornik, K., Stinchcombe, M., & White, H. (1989). Multilayer feedforward networks are universal approximators. Neural Networks, 2(5), 359-366. https://www.sciencedirect.com/science/article/abs/pii/0893608089900208
  3. Hornik, K. (1991). Approximation capabilities of multilayer feedforward networks. Neural Networks, 4(2), 251-257. https://www.sciencedirect.com/science/article/abs/pii/089360809190009T
  4. Leshno, M., Lin, V. Y., Pinkus, A., & Schocken, S. (1993). Multilayer feedforward networks with a nonpolynomial activation function can approximate any function. Neural Networks, 6(6), 861-867. https://www.sciencedirect.com/science/article/abs/pii/S0893608005801315
  5. Barron, A. R. (1993). Universal approximation bounds for superpositions of a sigmoidal function. IEEE Transactions on Information Theory, 39(3), 930-945. https://ieeexplore.ieee.org/document/256500
  6. Telgarsky, M. (2016). Benefits of depth in neural networks. Conference on Learning Theory (COLT). https://proceedings.mlr.press/v49/telgarsky16.html
  7. Lu, Z., Pu, H., Wang, F., Hu, Z., & Wang, L. (2017). The expressive power of neural networks: A view from the width. Advances in Neural Information Processing Systems (NeurIPS). https://papers.nips.cc/paper/2017/hash/32cbf687880eb1674a07bf717761dd3a-Abstract.html
  8. Hanin, B., & Sellke, M. (2017). Approximating continuous functions by ReLU nets of minimal width. https://arxiv.org/abs/1710.11278
  9. Pinkus, A. (1999). Approximation theory of the MLP model in neural networks. Acta Numerica, 8, 143-195. https://www.cambridge.org/core/journals/acta-numerica/article/abs/approximation-theory-of-the-mlp-model-in-neural-networks/18072C558C8410C4F92A82BCC8FC8CF9