143 Independent Component Analysis
Independent Component Analysis (ICA) is a computational framework for recovering a set of underlying source signals from observed mixtures of those signals, given only the mixtures and a few weak statistical assumptions. It is the canonical solution to the blind source separation problem, where the word blind signals that we know almost nothing about either the sources or the mixing process. This chapter develops the theory rigorously, beginning with the generative model, then clarifying the crucial distinction between statistical independence and mere uncorrelatedness, establishing why non-Gaussianity is the engine that makes separation possible, and finally deriving the FastICA algorithm that has become the practical workhorse of the field.
143.1 1. The Blind Source Separation Problem
143.1.1 1.1 The Generative Model
Suppose there are \(n\) statistically independent source signals collected into a random vector \(\mathbf{s} = (s_1, s_2, \ldots, s_n)^\top\). We do not observe these sources directly. Instead we observe \(n\) linear mixtures collected into a vector \(\mathbf{x} = (x_1, \ldots, x_n)^\top\), formed by an unknown invertible mixing matrix \(\mathbf{A} \in \mathbb{R}^{n \times n}\):
\[ \mathbf{x} = \mathbf{A}\mathbf{s}, \qquad x_i = \sum_{j=1}^{n} a_{ij}\, s_j . \]
The goal of ICA is to estimate an unmixing matrix \(\mathbf{W}\) such that
\[ \mathbf{y} = \mathbf{W}\mathbf{x} \]
recovers the sources, ideally with \(\mathbf{W} = \mathbf{A}^{-1}\) up to acceptable ambiguities. The defining difficulty is that we know neither \(\mathbf{A}\) nor \(\mathbf{s}\). All inference must proceed from the observed samples of \(\mathbf{x}\) alone, leaning entirely on the structural assumption that the components of \(\mathbf{s}\) are mutually independent.
143.1.2 1.2 The Cocktail Party Problem
The intuition pump for ICA is the cocktail party problem. Imagine \(n\) people speaking simultaneously in a room containing \(n\) microphones. Each microphone records a different weighted sum of all the voices, because each speaker sits at a different distance from each microphone. The recordings \(x_i(t)\) are the mixtures, the speech signals \(s_j(t)\) are the sources, and the attenuation coefficients form the matrix \(\mathbf{A}\). A human listener can attend to a single conversation amid the babble. ICA gives a machine an analogous ability: from the raw microphone recordings alone, it reconstructs the individual voices without knowing where anyone is standing or what they are saying.
143.1.3 1.3 Inherent Ambiguities
Two ambiguities cannot be resolved by ICA and are usually harmless. First, the variances of the independent components are not identifiable. Since both \(\mathbf{s}\) and \(\mathbf{A}\) are unknown, any scalar multiplying a source \(s_j\) can be absorbed by dividing the corresponding column of \(\mathbf{A}\):
\[ \mathbf{x} = \sum_j \left( \tfrac{1}{\alpha_j}\mathbf{a}_j \right)\left( \alpha_j s_j \right). \]
By convention we fix this scale ambiguity by assuming each source has unit variance, \(\mathbb{E}[s_j^2] = 1\), which still leaves an unresolved sign. Second, the order of the components is arbitrary, because we can permute the sources and apply the inverse permutation to the columns of \(\mathbf{A}\) without changing \(\mathbf{x}\). Formally, ICA recovers \(\mathbf{W} = \mathbf{P}\mathbf{D}\mathbf{A}^{-1}\) for some permutation matrix \(\mathbf{P}\) and diagonal scaling matrix \(\mathbf{D}\).
143.3 3. Non-Gaussianity as the Key to Separation
143.3.1 3.1 The Central Limit Theorem Intuition
Why should non-Gaussianity matter? Consider a single estimated component \(y = \mathbf{w}^\top \mathbf{x} = \mathbf{w}^\top \mathbf{A}\mathbf{s} = \mathbf{q}^\top \mathbf{s}\), where \(\mathbf{q} = \mathbf{A}^\top \mathbf{w}\). The estimate \(y\) is a linear combination of the independent sources. The Central Limit Theorem says that a sum of independent random variables tends toward a Gaussian distribution as the number of terms grows. Consequently, \(\mathbf{q}^\top \mathbf{s}\) is generally more Gaussian than any single source \(s_j\), and it is least Gaussian precisely when it equals one of the sources, meaning \(\mathbf{q}\) has a single nonzero entry. This suggests a striking principle: maximizing the non-Gaussianity of \(\mathbf{w}^\top \mathbf{x}\) recovers an independent component. Each local maximum of non-Gaussianity over the unit sphere corresponds to one of the sources.
143.3.2 3.2 The Gaussian Degeneracy
This principle also reveals the fundamental limitation of ICA. If two or more sources are Gaussian, ICA cannot separate them. For Gaussian variables, uncorrelatedness already implies independence, so any orthogonal rotation of jointly Gaussian whitened data remains jointly Gaussian and independent. The likelihood is invariant under such rotations, and there is no preferred basis to recover. ICA therefore requires that at most one source be Gaussian. This is the mirror image of the usual statistical preference for Gaussianity: here non-Gaussianity is not a nuisance but the very signal that drives identifiability.
143.3.3 3.3 Measuring Non-Gaussianity with Kurtosis
The classical measure of non-Gaussianity is the fourth order cumulant, kurtosis. For a zero mean random variable \(y\) with unit variance,
\[ \mathrm{kurt}(y) = \mathbb{E}[y^4] - 3\big(\mathbb{E}[y^2]\big)^2 = \mathbb{E}[y^4] - 3 . \]
Kurtosis is zero for a Gaussian variable. It is positive for supergaussian or leptokurtic variables, which have heavy tails and a sharp peak, such as a Laplacian distribution, and negative for subgaussian or platykurtic variables, which are flatter than Gaussian, such as a uniform distribution. Non-Gaussianity is measured by \(|\mathrm{kurt}(y)|\) or \(\mathrm{kurt}(y)^2\). Kurtosis is computationally simple and theoretically transparent, but it is sensitive to outliers, because a few extreme samples in the tail can dominate the fourth power. This fragility motivates more robust contrast functions.
143.3.4 3.4 Negentropy and Robust Contrast Functions
A more principled measure is negentropy, grounded in information theory. Among all distributions with a fixed variance, the Gaussian has maximum differential entropy. Negentropy is defined as the entropy gap between a variable and a Gaussian \(y_{\mathrm{gauss}}\) of identical covariance:
\[ J(y) = H(y_{\mathrm{gauss}}) - H(y), \qquad H(y) = -\int p(y)\log p(y)\, dy . \]
Negentropy is always nonnegative, is zero only for a Gaussian, and is invariant under invertible linear transformations, which makes it a theoretically optimal contrast. Its drawback is that computing \(J(y)\) exactly requires the unknown density. Practical ICA therefore uses approximations of the form
\[ J(y) \approx \big( \mathbb{E}[G(y)] - \mathbb{E}[G(\nu)] \big)^2, \]
where \(\nu\) is a standard Gaussian and \(G\) is a smooth nonquadratic function chosen to grow slowly so as to resist outliers. Common choices are
\[ G_1(u) = \frac{1}{a_1}\log\cosh(a_1 u), \qquad G_2(u) = -\exp\!\left(-\frac{u^2}{2}\right), \]
with \(1 \le a_1 \le 2\). The function \(\log\cosh\) behaves like the absolute value for large arguments and so down-weights extreme samples relative to the fourth power used by kurtosis.
143.4 4. The FastICA Algorithm
143.4.1 4.1 A Fixed Point Iteration
FastICA, introduced by Hyvarinen and Oja, is a fixed point algorithm that maximizes the negentropy approximation with remarkable efficiency. It operates on whitened data \(\mathbf{z}\) and seeks a unit weight vector \(\mathbf{w}\) that maximizes the non-Gaussianity of \(\mathbf{w}^\top \mathbf{z}\). Let \(g\) denote the derivative of the contrast function \(G\), and \(g'\) its second derivative. For the choices above,
\[ g_1(u) = \tanh(a_1 u), \qquad g_2(u) = u\,\exp\!\left(-\frac{u^2}{2}\right). \]
The maxima of the negentropy approximation occur at points satisfying a Lagrangian stationarity condition. Applying an approximate Newton method to that condition yields the FastICA update for a single component:
\[ \mathbf{w}^{+} \leftarrow \mathbb{E}\big[\mathbf{z}\, g(\mathbf{w}^\top \mathbf{z})\big] - \mathbb{E}\big[g'(\mathbf{w}^\top \mathbf{z})\big]\,\mathbf{w}, \]
\[ \mathbf{w} \leftarrow \frac{\mathbf{w}^{+}}{\lVert \mathbf{w}^{+} \rVert}. \]
The first expectation is the gradient term, and the second is a scalar approximation of the Hessian that makes the step a Newton step rather than a plain gradient ascent. The normalization keeps \(\mathbf{w}\) on the unit sphere, which is legitimate because the data is whitened. In practice the expectations are replaced by sample averages over the observed data.
143.4.2 4.2 Pseudocode for One Component
Input: whitened data Z (samples in columns), contrast g and its derivative g'
Initialize w to a random unit vector
repeat
wp = mean over samples of [ Z * g(w^T Z) ] - mean[ g'(w^T Z) ] * w
w = wp / norm(wp)
until |w_new dot w_old| is close to 1
return w
The convergence criterion checks whether the new and old weight vectors point in essentially the same direction, since the sign is irrelevant. Convergence is cubic for the kurtosis based contrast and at least quadratic in general, far faster than the linear convergence of gradient methods, and it requires no learning rate to tune.
143.4.3 4.3 Extracting Several Components
To recover all \(n\) sources, FastICA estimates several vectors \(\mathbf{w}_1, \ldots, \mathbf{w}_n\) and constrains them to be orthogonal in the whitened space, since distinct independent components correspond to orthogonal directions after whitening. Two schemes are common. Deflationary orthogonalization extracts components one at a time, and after each new \(\mathbf{w}_p\) is updated it is orthogonalized against the previously found vectors by a Gram-Schmidt step:
\[ \mathbf{w}_p \leftarrow \mathbf{w}_p - \sum_{j=1}^{p-1} (\mathbf{w}_p^\top \mathbf{w}_j)\, \mathbf{w}_j, \qquad \mathbf{w}_p \leftarrow \frac{\mathbf{w}_p}{\lVert \mathbf{w}_p \rVert}. \]
Symmetric orthogonalization instead updates all vectors in parallel and then orthogonalizes the whole matrix \(\mathbf{W} = (\mathbf{w}_1, \ldots, \mathbf{w}_n)^\top\) at once:
\[ \mathbf{W} \leftarrow \big(\mathbf{W}\mathbf{W}^\top\big)^{-1/2}\, \mathbf{W} . \]
The symmetric scheme avoids the accumulation of estimation errors that the sequential deflation can suffer, and it does not privilege any component over another, so it is often preferred.
143.4.4 4.4 The Complete Pipeline
A full ICA analysis proceeds in three stages. First, center the data by subtracting the mean. Second, whiten it using the eigendecomposition of the covariance matrix, which both decorrelates the components and reduces the problem to a search over orthogonal matrices. Third, run FastICA with symmetric orthogonalization to find the unmixing rotation. The recovered sources are then \(\mathbf{y} = \mathbf{W}\mathbf{z}\), and the columns of the estimated mixing matrix follow by inverting the composed transformation. Dimensionality reduction can be folded into the whitening step by retaining only the leading principal components, which is valuable when the number of mixtures exceeds the number of sources of interest.
143.5 5. Connections, Assumptions, and Practice
143.5.1 5.1 Maximum Likelihood and Infomax
The non-Gaussianity view is one of several equivalent formulations. ICA can also be derived by maximum likelihood, where one posits source densities and maximizes the likelihood of the observed mixtures over the unmixing matrix, and by the Infomax principle, which maximizes the output entropy of a nonlinear neural network. Both lead to update rules closely related to FastICA, and the nonlinearity \(g\) plays the role of the score function of the assumed source density. A further equivalent objective is the minimization of mutual information among the outputs, which is the most direct formalization of independence and connects negentropy maximization to information theory.
143.5.2 5.2 Assumptions and Limitations
The standard model carries several assumptions worth stating plainly. The mixing is linear and instantaneous, with no time delays or convolution, which is violated by realistic room acoustics and requires convolutive ICA extensions. The mixing matrix is square and invertible, so the number of sensors equals the number of sources; the underdetermined case where sources outnumber sensors needs sparse methods. At most one source may be Gaussian. The sources are assumed stationary and mutually independent. Real applications must check these assumptions or adopt extensions that relax them.
143.5.3 5.3 Applications
ICA has found wide use precisely where independent generative sources mix linearly. In biomedical signal processing it separates artifacts such as eye blinks and heartbeat from electroencephalography and magnetoencephalography recordings. In neuroimaging it identifies spatially independent networks in functional MRI data. It serves in financial time series analysis, telecommunications for separating interfering transmissions, and feature extraction for natural images, where the learned independent components resemble the oriented edge detectors found in early visual cortex. Across these domains the same mathematical core operates: exploit non-Gaussianity to undo an unknown linear mixing.
143.6 6. Summary
Independent Component Analysis solves blind source separation by recovering independent sources from their linear mixtures. Its conceptual pivot is the recognition that statistical independence is far stronger than uncorrelatedness, so that second order methods like PCA and whitening, while useful as preprocessing, cannot finish the job. The missing ingredient is non-Gaussianity, which the Central Limit Theorem links to identifiability: the most non-Gaussian linear projections of the mixtures are the original sources. Measured by kurtosis or, more robustly, by negentropy, non-Gaussianity becomes an objective that the FastICA fixed point iteration maximizes with fast and reliable convergence. Within its assumptions of linearity, a square mixing matrix, and at most one Gaussian source, ICA delivers a powerful and widely deployed tool for uncovering hidden structure in mixed signals.
143.7 References
Hyvarinen, A., and Oja, E. (2000). Independent Component Analysis: Algorithms and Applications. Neural Networks, 13(4), 411 to 430. https://www.cs.helsinki.fi/u/ahyvarin/papers/NN00new.pdf
Hyvarinen, A. (1999). Fast and Robust Fixed-Point Algorithms for Independent Component Analysis. IEEE Transactions on Neural Networks, 10(3), 626 to 634. https://www.cs.helsinki.fi/u/ahyvarin/papers/TNN99new.pdf
Hyvarinen, A., Karhunen, J., and Oja, E. (2001). Independent Component Analysis. John Wiley and Sons. https://www.cs.helsinki.fi/u/ahyvarin/papers/bookfinal_ICA.pdf
Bell, A. J., and Sejnowski, T. J. (1995). An Information-Maximization Approach to Blind Separation and Blind Deconvolution. Neural Computation, 7(6), 1129 to 1159. https://www.cnl.salk.edu/~tony/ptr/bell.pdf
Comon, P. (1994). Independent Component Analysis, A New Concept? Signal Processing, 36(3), 287 to 314. https://hal.science/hal-00417283/document
Cardoso, J. F. (1998). Blind Signal Separation: Statistical Principles. Proceedings of the IEEE, 86(10), 2009 to 2025. https://www2.iap.fr/users/cardoso/papers/ProcIEEE.pdf
Stone, J. V. (2004). Independent Component Analysis: A Tutorial Introduction. MIT Press. https://mitpress.mit.edu/9780262693158/independent-component-analysis/